Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
110 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Evolving Code with A Large Language Model (2401.07102v1)

Published 13 Jan 2024 in cs.NE and cs.AI

Abstract: Algorithms that use LLMs to evolve code arrived on the Genetic Programming (GP) scene very recently. We present LLM GP, a formalized LLM-based evolutionary algorithm designed to evolve code. Like GP, it uses evolutionary operators, but its designs and implementations of those operators radically differ from GP's because they enlist an LLM, using prompting and the LLM's pre-trained pattern matching and sequence completion capability. We also present a demonstration-level variant of LLM GP and share its code. By addressing algorithms that range from the formal to hands-on, we cover design and LLM-usage considerations as well as the scientific challenges that arise when using an LLM for genetic programming.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (48)
  1. Griffith, S., Subramanian, K., Scholz, J., Isbell, C.L., Thomaz, A.L.: Policy shaping: Integrating human feedback with reinforcement learning. Advances in neural information processing systems 26 (2013) Bradley et al. [2024] Bradley, H., Fan, H., Galanos, T., Zhou, R., Scott, D., Lehman, J.: The openelm library: Leveraging progress in language models for novel evolutionary algorithms. In: Genetic Programming Theory and Practice XX. Springer, ??? (2024) Chen et al. [2023] Chen, A., Dohan, D.M., So, D.R.: Evoprompting: Language models for code-level neural architecture search. arXiv preprint arXiv:2302.14838 (2023) Liventsev et al. [2023] Liventsev, V., Grishina, A., Härmä, A., Moonen, L.: Fully autonomous programming with large language models. arXiv preprint arXiv:2304.10423 (2023) O’Neill et al. [2010] O’Neill, M., Vanneschi, L., Gustafson, S., Banzhaf, W.: Open issues in genetic programming. Genetic Programming and Evolvable Machines 11, 339–363 (2010) O’Neill and Spector [2020] O’Neill, M., Spector, L.: Automatic programming: The open issue? Genetic Programming and Evolvable Machines 21, 251–262 (2020) Liu et al. [2023] Liu, P., Yuan, W., Fu, J., Jiang, Z., Hayashi, H., Neubig, G.: Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing. ACM Computing Surveys 55(9), 1–35 (2023) Radford et al. [2019] Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., Sutskever, I., et al.: Language models are unsupervised multitask learners. OpenAI blog 1(8), 9 (2019) Brown et al. [2020] Brown, T.B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D.M., Wu, J., Winter, C., Hesse, C., Chen, M., Sigler, E., Litwin, M., Gray, S., Chess, B., Clark, J., Berner, C., McCandlish, S., Radford, A., Sutskever, I., Amodei, D.: Language Models are Few-Shot Learners (2020) OpenAI [2023] OpenAI: GPT-4 Technical Report (2023) Phuong and Hutter [2022] Phuong, M., Hutter, M.: Formal algorithms for transformers. arXiv preprint arXiv:2207.09238 (2022) Ji et al. [2023] Ji, Z., Lee, N., Frieske, R., Yu, T., Su, D., Xu, Y., Ishii, E., Bang, Y.J., Madotto, A., Fung, P.: Survey of hallucination in natural language generation. ACM Comput. Surv. 55(12) (2023) https://doi.org/10.1145/3571730 Strubell et al. [2020] Strubell, E., Ganesh, A., McCallum, A.: Energy and policy considerations for modern deep learning research. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, pp. 13693–13696 (2020) Patterson et al. [2021] Patterson, D., Gonzalez, J., Le, Q., Liang, C., Munguia, L.-M., Rothchild, D., So, D., Texier, M., Dean, J.: Carbon emissions and large neural network training. arXiv preprint arXiv:2104.10350 (2021) Wu et al. [2022] Wu, C.-J., Raghavendra, R., Gupta, U., Acun, B., Ardalani, N., Maeng, K., Chang, G., Aga, F., Huang, J., Bai, C., et al.: Sustainable ai: Environmental implications, challenges and opportunities. Proceedings of Machine Learning and Systems 4, 795–813 (2022) Kaack et al. [2022] Kaack, L.H., Donti, P.L., Strubell, E., Kamiya, G., Creutzig, F., Rolnick, D.: Aligning artificial intelligence with climate change mitigation. Nature Climate Change 12(6), 518–527 (2022) Zhou et al. [2022] Zhou, H., Nova, A., Larochelle, H., Courville, A., Neyshabur, B., Sedghi, H.: Teaching algorithmic reasoning via in-context learning. arXiv preprint arXiv:2211.09066 (2022) Izacard et al. [2022] Izacard, G., Lewis, P., Lomeli, M., Hosseini, L., Petroni, F., Schick, T., Dwivedi-Yu, J., Joulin, A., Riedel, S., Grave, E.: Few-shot learning with retrieval augmented language models. arXiv preprint arXiv:2208.03299 (2022) Wei et al. [2022] Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E., Le, Q., Zhou, D.: Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903 (2022) Wang et al. [2023] Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-Consistency Improves Chain of Thought Reasoning in Language Models (2023) Shao et al. [2023] Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Bradley, H., Fan, H., Galanos, T., Zhou, R., Scott, D., Lehman, J.: The openelm library: Leveraging progress in language models for novel evolutionary algorithms. In: Genetic Programming Theory and Practice XX. Springer, ??? (2024) Chen et al. [2023] Chen, A., Dohan, D.M., So, D.R.: Evoprompting: Language models for code-level neural architecture search. arXiv preprint arXiv:2302.14838 (2023) Liventsev et al. [2023] Liventsev, V., Grishina, A., Härmä, A., Moonen, L.: Fully autonomous programming with large language models. arXiv preprint arXiv:2304.10423 (2023) O’Neill et al. [2010] O’Neill, M., Vanneschi, L., Gustafson, S., Banzhaf, W.: Open issues in genetic programming. Genetic Programming and Evolvable Machines 11, 339–363 (2010) O’Neill and Spector [2020] O’Neill, M., Spector, L.: Automatic programming: The open issue? Genetic Programming and Evolvable Machines 21, 251–262 (2020) Liu et al. [2023] Liu, P., Yuan, W., Fu, J., Jiang, Z., Hayashi, H., Neubig, G.: Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing. ACM Computing Surveys 55(9), 1–35 (2023) Radford et al. [2019] Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., Sutskever, I., et al.: Language models are unsupervised multitask learners. OpenAI blog 1(8), 9 (2019) Brown et al. [2020] Brown, T.B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D.M., Wu, J., Winter, C., Hesse, C., Chen, M., Sigler, E., Litwin, M., Gray, S., Chess, B., Clark, J., Berner, C., McCandlish, S., Radford, A., Sutskever, I., Amodei, D.: Language Models are Few-Shot Learners (2020) OpenAI [2023] OpenAI: GPT-4 Technical Report (2023) Phuong and Hutter [2022] Phuong, M., Hutter, M.: Formal algorithms for transformers. arXiv preprint arXiv:2207.09238 (2022) Ji et al. [2023] Ji, Z., Lee, N., Frieske, R., Yu, T., Su, D., Xu, Y., Ishii, E., Bang, Y.J., Madotto, A., Fung, P.: Survey of hallucination in natural language generation. ACM Comput. Surv. 55(12) (2023) https://doi.org/10.1145/3571730 Strubell et al. [2020] Strubell, E., Ganesh, A., McCallum, A.: Energy and policy considerations for modern deep learning research. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, pp. 13693–13696 (2020) Patterson et al. [2021] Patterson, D., Gonzalez, J., Le, Q., Liang, C., Munguia, L.-M., Rothchild, D., So, D., Texier, M., Dean, J.: Carbon emissions and large neural network training. arXiv preprint arXiv:2104.10350 (2021) Wu et al. [2022] Wu, C.-J., Raghavendra, R., Gupta, U., Acun, B., Ardalani, N., Maeng, K., Chang, G., Aga, F., Huang, J., Bai, C., et al.: Sustainable ai: Environmental implications, challenges and opportunities. Proceedings of Machine Learning and Systems 4, 795–813 (2022) Kaack et al. [2022] Kaack, L.H., Donti, P.L., Strubell, E., Kamiya, G., Creutzig, F., Rolnick, D.: Aligning artificial intelligence with climate change mitigation. Nature Climate Change 12(6), 518–527 (2022) Zhou et al. [2022] Zhou, H., Nova, A., Larochelle, H., Courville, A., Neyshabur, B., Sedghi, H.: Teaching algorithmic reasoning via in-context learning. arXiv preprint arXiv:2211.09066 (2022) Izacard et al. [2022] Izacard, G., Lewis, P., Lomeli, M., Hosseini, L., Petroni, F., Schick, T., Dwivedi-Yu, J., Joulin, A., Riedel, S., Grave, E.: Few-shot learning with retrieval augmented language models. arXiv preprint arXiv:2208.03299 (2022) Wei et al. [2022] Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E., Le, Q., Zhou, D.: Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903 (2022) Wang et al. [2023] Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-Consistency Improves Chain of Thought Reasoning in Language Models (2023) Shao et al. [2023] Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Chen, A., Dohan, D.M., So, D.R.: Evoprompting: Language models for code-level neural architecture search. arXiv preprint arXiv:2302.14838 (2023) Liventsev et al. [2023] Liventsev, V., Grishina, A., Härmä, A., Moonen, L.: Fully autonomous programming with large language models. arXiv preprint arXiv:2304.10423 (2023) O’Neill et al. [2010] O’Neill, M., Vanneschi, L., Gustafson, S., Banzhaf, W.: Open issues in genetic programming. Genetic Programming and Evolvable Machines 11, 339–363 (2010) O’Neill and Spector [2020] O’Neill, M., Spector, L.: Automatic programming: The open issue? Genetic Programming and Evolvable Machines 21, 251–262 (2020) Liu et al. [2023] Liu, P., Yuan, W., Fu, J., Jiang, Z., Hayashi, H., Neubig, G.: Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing. ACM Computing Surveys 55(9), 1–35 (2023) Radford et al. [2019] Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., Sutskever, I., et al.: Language models are unsupervised multitask learners. OpenAI blog 1(8), 9 (2019) Brown et al. [2020] Brown, T.B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D.M., Wu, J., Winter, C., Hesse, C., Chen, M., Sigler, E., Litwin, M., Gray, S., Chess, B., Clark, J., Berner, C., McCandlish, S., Radford, A., Sutskever, I., Amodei, D.: Language Models are Few-Shot Learners (2020) OpenAI [2023] OpenAI: GPT-4 Technical Report (2023) Phuong and Hutter [2022] Phuong, M., Hutter, M.: Formal algorithms for transformers. arXiv preprint arXiv:2207.09238 (2022) Ji et al. [2023] Ji, Z., Lee, N., Frieske, R., Yu, T., Su, D., Xu, Y., Ishii, E., Bang, Y.J., Madotto, A., Fung, P.: Survey of hallucination in natural language generation. ACM Comput. Surv. 55(12) (2023) https://doi.org/10.1145/3571730 Strubell et al. [2020] Strubell, E., Ganesh, A., McCallum, A.: Energy and policy considerations for modern deep learning research. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, pp. 13693–13696 (2020) Patterson et al. [2021] Patterson, D., Gonzalez, J., Le, Q., Liang, C., Munguia, L.-M., Rothchild, D., So, D., Texier, M., Dean, J.: Carbon emissions and large neural network training. arXiv preprint arXiv:2104.10350 (2021) Wu et al. [2022] Wu, C.-J., Raghavendra, R., Gupta, U., Acun, B., Ardalani, N., Maeng, K., Chang, G., Aga, F., Huang, J., Bai, C., et al.: Sustainable ai: Environmental implications, challenges and opportunities. Proceedings of Machine Learning and Systems 4, 795–813 (2022) Kaack et al. [2022] Kaack, L.H., Donti, P.L., Strubell, E., Kamiya, G., Creutzig, F., Rolnick, D.: Aligning artificial intelligence with climate change mitigation. Nature Climate Change 12(6), 518–527 (2022) Zhou et al. [2022] Zhou, H., Nova, A., Larochelle, H., Courville, A., Neyshabur, B., Sedghi, H.: Teaching algorithmic reasoning via in-context learning. arXiv preprint arXiv:2211.09066 (2022) Izacard et al. [2022] Izacard, G., Lewis, P., Lomeli, M., Hosseini, L., Petroni, F., Schick, T., Dwivedi-Yu, J., Joulin, A., Riedel, S., Grave, E.: Few-shot learning with retrieval augmented language models. arXiv preprint arXiv:2208.03299 (2022) Wei et al. [2022] Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E., Le, Q., Zhou, D.: Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903 (2022) Wang et al. [2023] Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-Consistency Improves Chain of Thought Reasoning in Language Models (2023) Shao et al. [2023] Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Liventsev, V., Grishina, A., Härmä, A., Moonen, L.: Fully autonomous programming with large language models. arXiv preprint arXiv:2304.10423 (2023) O’Neill et al. [2010] O’Neill, M., Vanneschi, L., Gustafson, S., Banzhaf, W.: Open issues in genetic programming. Genetic Programming and Evolvable Machines 11, 339–363 (2010) O’Neill and Spector [2020] O’Neill, M., Spector, L.: Automatic programming: The open issue? Genetic Programming and Evolvable Machines 21, 251–262 (2020) Liu et al. [2023] Liu, P., Yuan, W., Fu, J., Jiang, Z., Hayashi, H., Neubig, G.: Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing. ACM Computing Surveys 55(9), 1–35 (2023) Radford et al. [2019] Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., Sutskever, I., et al.: Language models are unsupervised multitask learners. OpenAI blog 1(8), 9 (2019) Brown et al. [2020] Brown, T.B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D.M., Wu, J., Winter, C., Hesse, C., Chen, M., Sigler, E., Litwin, M., Gray, S., Chess, B., Clark, J., Berner, C., McCandlish, S., Radford, A., Sutskever, I., Amodei, D.: Language Models are Few-Shot Learners (2020) OpenAI [2023] OpenAI: GPT-4 Technical Report (2023) Phuong and Hutter [2022] Phuong, M., Hutter, M.: Formal algorithms for transformers. arXiv preprint arXiv:2207.09238 (2022) Ji et al. [2023] Ji, Z., Lee, N., Frieske, R., Yu, T., Su, D., Xu, Y., Ishii, E., Bang, Y.J., Madotto, A., Fung, P.: Survey of hallucination in natural language generation. ACM Comput. Surv. 55(12) (2023) https://doi.org/10.1145/3571730 Strubell et al. [2020] Strubell, E., Ganesh, A., McCallum, A.: Energy and policy considerations for modern deep learning research. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, pp. 13693–13696 (2020) Patterson et al. [2021] Patterson, D., Gonzalez, J., Le, Q., Liang, C., Munguia, L.-M., Rothchild, D., So, D., Texier, M., Dean, J.: Carbon emissions and large neural network training. arXiv preprint arXiv:2104.10350 (2021) Wu et al. [2022] Wu, C.-J., Raghavendra, R., Gupta, U., Acun, B., Ardalani, N., Maeng, K., Chang, G., Aga, F., Huang, J., Bai, C., et al.: Sustainable ai: Environmental implications, challenges and opportunities. Proceedings of Machine Learning and Systems 4, 795–813 (2022) Kaack et al. [2022] Kaack, L.H., Donti, P.L., Strubell, E., Kamiya, G., Creutzig, F., Rolnick, D.: Aligning artificial intelligence with climate change mitigation. Nature Climate Change 12(6), 518–527 (2022) Zhou et al. [2022] Zhou, H., Nova, A., Larochelle, H., Courville, A., Neyshabur, B., Sedghi, H.: Teaching algorithmic reasoning via in-context learning. arXiv preprint arXiv:2211.09066 (2022) Izacard et al. [2022] Izacard, G., Lewis, P., Lomeli, M., Hosseini, L., Petroni, F., Schick, T., Dwivedi-Yu, J., Joulin, A., Riedel, S., Grave, E.: Few-shot learning with retrieval augmented language models. arXiv preprint arXiv:2208.03299 (2022) Wei et al. [2022] Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E., Le, Q., Zhou, D.: Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903 (2022) Wang et al. [2023] Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-Consistency Improves Chain of Thought Reasoning in Language Models (2023) Shao et al. [2023] Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) O’Neill, M., Vanneschi, L., Gustafson, S., Banzhaf, W.: Open issues in genetic programming. Genetic Programming and Evolvable Machines 11, 339–363 (2010) O’Neill and Spector [2020] O’Neill, M., Spector, L.: Automatic programming: The open issue? Genetic Programming and Evolvable Machines 21, 251–262 (2020) Liu et al. [2023] Liu, P., Yuan, W., Fu, J., Jiang, Z., Hayashi, H., Neubig, G.: Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing. ACM Computing Surveys 55(9), 1–35 (2023) Radford et al. [2019] Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., Sutskever, I., et al.: Language models are unsupervised multitask learners. OpenAI blog 1(8), 9 (2019) Brown et al. [2020] Brown, T.B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D.M., Wu, J., Winter, C., Hesse, C., Chen, M., Sigler, E., Litwin, M., Gray, S., Chess, B., Clark, J., Berner, C., McCandlish, S., Radford, A., Sutskever, I., Amodei, D.: Language Models are Few-Shot Learners (2020) OpenAI [2023] OpenAI: GPT-4 Technical Report (2023) Phuong and Hutter [2022] Phuong, M., Hutter, M.: Formal algorithms for transformers. arXiv preprint arXiv:2207.09238 (2022) Ji et al. [2023] Ji, Z., Lee, N., Frieske, R., Yu, T., Su, D., Xu, Y., Ishii, E., Bang, Y.J., Madotto, A., Fung, P.: Survey of hallucination in natural language generation. ACM Comput. Surv. 55(12) (2023) https://doi.org/10.1145/3571730 Strubell et al. [2020] Strubell, E., Ganesh, A., McCallum, A.: Energy and policy considerations for modern deep learning research. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, pp. 13693–13696 (2020) Patterson et al. [2021] Patterson, D., Gonzalez, J., Le, Q., Liang, C., Munguia, L.-M., Rothchild, D., So, D., Texier, M., Dean, J.: Carbon emissions and large neural network training. arXiv preprint arXiv:2104.10350 (2021) Wu et al. [2022] Wu, C.-J., Raghavendra, R., Gupta, U., Acun, B., Ardalani, N., Maeng, K., Chang, G., Aga, F., Huang, J., Bai, C., et al.: Sustainable ai: Environmental implications, challenges and opportunities. Proceedings of Machine Learning and Systems 4, 795–813 (2022) Kaack et al. [2022] Kaack, L.H., Donti, P.L., Strubell, E., Kamiya, G., Creutzig, F., Rolnick, D.: Aligning artificial intelligence with climate change mitigation. Nature Climate Change 12(6), 518–527 (2022) Zhou et al. [2022] Zhou, H., Nova, A., Larochelle, H., Courville, A., Neyshabur, B., Sedghi, H.: Teaching algorithmic reasoning via in-context learning. arXiv preprint arXiv:2211.09066 (2022) Izacard et al. [2022] Izacard, G., Lewis, P., Lomeli, M., Hosseini, L., Petroni, F., Schick, T., Dwivedi-Yu, J., Joulin, A., Riedel, S., Grave, E.: Few-shot learning with retrieval augmented language models. arXiv preprint arXiv:2208.03299 (2022) Wei et al. [2022] Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E., Le, Q., Zhou, D.: Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903 (2022) Wang et al. [2023] Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-Consistency Improves Chain of Thought Reasoning in Language Models (2023) Shao et al. [2023] Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) O’Neill, M., Spector, L.: Automatic programming: The open issue? Genetic Programming and Evolvable Machines 21, 251–262 (2020) Liu et al. [2023] Liu, P., Yuan, W., Fu, J., Jiang, Z., Hayashi, H., Neubig, G.: Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing. ACM Computing Surveys 55(9), 1–35 (2023) Radford et al. [2019] Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., Sutskever, I., et al.: Language models are unsupervised multitask learners. OpenAI blog 1(8), 9 (2019) Brown et al. [2020] Brown, T.B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D.M., Wu, J., Winter, C., Hesse, C., Chen, M., Sigler, E., Litwin, M., Gray, S., Chess, B., Clark, J., Berner, C., McCandlish, S., Radford, A., Sutskever, I., Amodei, D.: Language Models are Few-Shot Learners (2020) OpenAI [2023] OpenAI: GPT-4 Technical Report (2023) Phuong and Hutter [2022] Phuong, M., Hutter, M.: Formal algorithms for transformers. arXiv preprint arXiv:2207.09238 (2022) Ji et al. [2023] Ji, Z., Lee, N., Frieske, R., Yu, T., Su, D., Xu, Y., Ishii, E., Bang, Y.J., Madotto, A., Fung, P.: Survey of hallucination in natural language generation. ACM Comput. Surv. 55(12) (2023) https://doi.org/10.1145/3571730 Strubell et al. [2020] Strubell, E., Ganesh, A., McCallum, A.: Energy and policy considerations for modern deep learning research. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, pp. 13693–13696 (2020) Patterson et al. [2021] Patterson, D., Gonzalez, J., Le, Q., Liang, C., Munguia, L.-M., Rothchild, D., So, D., Texier, M., Dean, J.: Carbon emissions and large neural network training. arXiv preprint arXiv:2104.10350 (2021) Wu et al. [2022] Wu, C.-J., Raghavendra, R., Gupta, U., Acun, B., Ardalani, N., Maeng, K., Chang, G., Aga, F., Huang, J., Bai, C., et al.: Sustainable ai: Environmental implications, challenges and opportunities. Proceedings of Machine Learning and Systems 4, 795–813 (2022) Kaack et al. [2022] Kaack, L.H., Donti, P.L., Strubell, E., Kamiya, G., Creutzig, F., Rolnick, D.: Aligning artificial intelligence with climate change mitigation. Nature Climate Change 12(6), 518–527 (2022) Zhou et al. [2022] Zhou, H., Nova, A., Larochelle, H., Courville, A., Neyshabur, B., Sedghi, H.: Teaching algorithmic reasoning via in-context learning. arXiv preprint arXiv:2211.09066 (2022) Izacard et al. [2022] Izacard, G., Lewis, P., Lomeli, M., Hosseini, L., Petroni, F., Schick, T., Dwivedi-Yu, J., Joulin, A., Riedel, S., Grave, E.: Few-shot learning with retrieval augmented language models. arXiv preprint arXiv:2208.03299 (2022) Wei et al. [2022] Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E., Le, Q., Zhou, D.: Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903 (2022) Wang et al. [2023] Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-Consistency Improves Chain of Thought Reasoning in Language Models (2023) Shao et al. [2023] Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Liu, P., Yuan, W., Fu, J., Jiang, Z., Hayashi, H., Neubig, G.: Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing. ACM Computing Surveys 55(9), 1–35 (2023) Radford et al. [2019] Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., Sutskever, I., et al.: Language models are unsupervised multitask learners. OpenAI blog 1(8), 9 (2019) Brown et al. [2020] Brown, T.B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D.M., Wu, J., Winter, C., Hesse, C., Chen, M., Sigler, E., Litwin, M., Gray, S., Chess, B., Clark, J., Berner, C., McCandlish, S., Radford, A., Sutskever, I., Amodei, D.: Language Models are Few-Shot Learners (2020) OpenAI [2023] OpenAI: GPT-4 Technical Report (2023) Phuong and Hutter [2022] Phuong, M., Hutter, M.: Formal algorithms for transformers. arXiv preprint arXiv:2207.09238 (2022) Ji et al. [2023] Ji, Z., Lee, N., Frieske, R., Yu, T., Su, D., Xu, Y., Ishii, E., Bang, Y.J., Madotto, A., Fung, P.: Survey of hallucination in natural language generation. ACM Comput. Surv. 55(12) (2023) https://doi.org/10.1145/3571730 Strubell et al. [2020] Strubell, E., Ganesh, A., McCallum, A.: Energy and policy considerations for modern deep learning research. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, pp. 13693–13696 (2020) Patterson et al. [2021] Patterson, D., Gonzalez, J., Le, Q., Liang, C., Munguia, L.-M., Rothchild, D., So, D., Texier, M., Dean, J.: Carbon emissions and large neural network training. arXiv preprint arXiv:2104.10350 (2021) Wu et al. [2022] Wu, C.-J., Raghavendra, R., Gupta, U., Acun, B., Ardalani, N., Maeng, K., Chang, G., Aga, F., Huang, J., Bai, C., et al.: Sustainable ai: Environmental implications, challenges and opportunities. Proceedings of Machine Learning and Systems 4, 795–813 (2022) Kaack et al. [2022] Kaack, L.H., Donti, P.L., Strubell, E., Kamiya, G., Creutzig, F., Rolnick, D.: Aligning artificial intelligence with climate change mitigation. Nature Climate Change 12(6), 518–527 (2022) Zhou et al. [2022] Zhou, H., Nova, A., Larochelle, H., Courville, A., Neyshabur, B., Sedghi, H.: Teaching algorithmic reasoning via in-context learning. arXiv preprint arXiv:2211.09066 (2022) Izacard et al. [2022] Izacard, G., Lewis, P., Lomeli, M., Hosseini, L., Petroni, F., Schick, T., Dwivedi-Yu, J., Joulin, A., Riedel, S., Grave, E.: Few-shot learning with retrieval augmented language models. arXiv preprint arXiv:2208.03299 (2022) Wei et al. [2022] Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E., Le, Q., Zhou, D.: Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903 (2022) Wang et al. [2023] Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-Consistency Improves Chain of Thought Reasoning in Language Models (2023) Shao et al. [2023] Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., Sutskever, I., et al.: Language models are unsupervised multitask learners. OpenAI blog 1(8), 9 (2019) Brown et al. [2020] Brown, T.B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D.M., Wu, J., Winter, C., Hesse, C., Chen, M., Sigler, E., Litwin, M., Gray, S., Chess, B., Clark, J., Berner, C., McCandlish, S., Radford, A., Sutskever, I., Amodei, D.: Language Models are Few-Shot Learners (2020) OpenAI [2023] OpenAI: GPT-4 Technical Report (2023) Phuong and Hutter [2022] Phuong, M., Hutter, M.: Formal algorithms for transformers. arXiv preprint arXiv:2207.09238 (2022) Ji et al. [2023] Ji, Z., Lee, N., Frieske, R., Yu, T., Su, D., Xu, Y., Ishii, E., Bang, Y.J., Madotto, A., Fung, P.: Survey of hallucination in natural language generation. ACM Comput. Surv. 55(12) (2023) https://doi.org/10.1145/3571730 Strubell et al. [2020] Strubell, E., Ganesh, A., McCallum, A.: Energy and policy considerations for modern deep learning research. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, pp. 13693–13696 (2020) Patterson et al. [2021] Patterson, D., Gonzalez, J., Le, Q., Liang, C., Munguia, L.-M., Rothchild, D., So, D., Texier, M., Dean, J.: Carbon emissions and large neural network training. arXiv preprint arXiv:2104.10350 (2021) Wu et al. [2022] Wu, C.-J., Raghavendra, R., Gupta, U., Acun, B., Ardalani, N., Maeng, K., Chang, G., Aga, F., Huang, J., Bai, C., et al.: Sustainable ai: Environmental implications, challenges and opportunities. Proceedings of Machine Learning and Systems 4, 795–813 (2022) Kaack et al. [2022] Kaack, L.H., Donti, P.L., Strubell, E., Kamiya, G., Creutzig, F., Rolnick, D.: Aligning artificial intelligence with climate change mitigation. Nature Climate Change 12(6), 518–527 (2022) Zhou et al. [2022] Zhou, H., Nova, A., Larochelle, H., Courville, A., Neyshabur, B., Sedghi, H.: Teaching algorithmic reasoning via in-context learning. arXiv preprint arXiv:2211.09066 (2022) Izacard et al. [2022] Izacard, G., Lewis, P., Lomeli, M., Hosseini, L., Petroni, F., Schick, T., Dwivedi-Yu, J., Joulin, A., Riedel, S., Grave, E.: Few-shot learning with retrieval augmented language models. arXiv preprint arXiv:2208.03299 (2022) Wei et al. [2022] Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E., Le, Q., Zhou, D.: Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903 (2022) Wang et al. [2023] Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-Consistency Improves Chain of Thought Reasoning in Language Models (2023) Shao et al. [2023] Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Brown, T.B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D.M., Wu, J., Winter, C., Hesse, C., Chen, M., Sigler, E., Litwin, M., Gray, S., Chess, B., Clark, J., Berner, C., McCandlish, S., Radford, A., Sutskever, I., Amodei, D.: Language Models are Few-Shot Learners (2020) OpenAI [2023] OpenAI: GPT-4 Technical Report (2023) Phuong and Hutter [2022] Phuong, M., Hutter, M.: Formal algorithms for transformers. arXiv preprint arXiv:2207.09238 (2022) Ji et al. [2023] Ji, Z., Lee, N., Frieske, R., Yu, T., Su, D., Xu, Y., Ishii, E., Bang, Y.J., Madotto, A., Fung, P.: Survey of hallucination in natural language generation. ACM Comput. Surv. 55(12) (2023) https://doi.org/10.1145/3571730 Strubell et al. [2020] Strubell, E., Ganesh, A., McCallum, A.: Energy and policy considerations for modern deep learning research. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, pp. 13693–13696 (2020) Patterson et al. [2021] Patterson, D., Gonzalez, J., Le, Q., Liang, C., Munguia, L.-M., Rothchild, D., So, D., Texier, M., Dean, J.: Carbon emissions and large neural network training. arXiv preprint arXiv:2104.10350 (2021) Wu et al. [2022] Wu, C.-J., Raghavendra, R., Gupta, U., Acun, B., Ardalani, N., Maeng, K., Chang, G., Aga, F., Huang, J., Bai, C., et al.: Sustainable ai: Environmental implications, challenges and opportunities. Proceedings of Machine Learning and Systems 4, 795–813 (2022) Kaack et al. [2022] Kaack, L.H., Donti, P.L., Strubell, E., Kamiya, G., Creutzig, F., Rolnick, D.: Aligning artificial intelligence with climate change mitigation. Nature Climate Change 12(6), 518–527 (2022) Zhou et al. [2022] Zhou, H., Nova, A., Larochelle, H., Courville, A., Neyshabur, B., Sedghi, H.: Teaching algorithmic reasoning via in-context learning. arXiv preprint arXiv:2211.09066 (2022) Izacard et al. [2022] Izacard, G., Lewis, P., Lomeli, M., Hosseini, L., Petroni, F., Schick, T., Dwivedi-Yu, J., Joulin, A., Riedel, S., Grave, E.: Few-shot learning with retrieval augmented language models. arXiv preprint arXiv:2208.03299 (2022) Wei et al. [2022] Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E., Le, Q., Zhou, D.: Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903 (2022) Wang et al. [2023] Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-Consistency Improves Chain of Thought Reasoning in Language Models (2023) Shao et al. [2023] Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) OpenAI: GPT-4 Technical Report (2023) Phuong and Hutter [2022] Phuong, M., Hutter, M.: Formal algorithms for transformers. arXiv preprint arXiv:2207.09238 (2022) Ji et al. [2023] Ji, Z., Lee, N., Frieske, R., Yu, T., Su, D., Xu, Y., Ishii, E., Bang, Y.J., Madotto, A., Fung, P.: Survey of hallucination in natural language generation. ACM Comput. Surv. 55(12) (2023) https://doi.org/10.1145/3571730 Strubell et al. [2020] Strubell, E., Ganesh, A., McCallum, A.: Energy and policy considerations for modern deep learning research. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, pp. 13693–13696 (2020) Patterson et al. [2021] Patterson, D., Gonzalez, J., Le, Q., Liang, C., Munguia, L.-M., Rothchild, D., So, D., Texier, M., Dean, J.: Carbon emissions and large neural network training. arXiv preprint arXiv:2104.10350 (2021) Wu et al. [2022] Wu, C.-J., Raghavendra, R., Gupta, U., Acun, B., Ardalani, N., Maeng, K., Chang, G., Aga, F., Huang, J., Bai, C., et al.: Sustainable ai: Environmental implications, challenges and opportunities. Proceedings of Machine Learning and Systems 4, 795–813 (2022) Kaack et al. [2022] Kaack, L.H., Donti, P.L., Strubell, E., Kamiya, G., Creutzig, F., Rolnick, D.: Aligning artificial intelligence with climate change mitigation. Nature Climate Change 12(6), 518–527 (2022) Zhou et al. [2022] Zhou, H., Nova, A., Larochelle, H., Courville, A., Neyshabur, B., Sedghi, H.: Teaching algorithmic reasoning via in-context learning. arXiv preprint arXiv:2211.09066 (2022) Izacard et al. [2022] Izacard, G., Lewis, P., Lomeli, M., Hosseini, L., Petroni, F., Schick, T., Dwivedi-Yu, J., Joulin, A., Riedel, S., Grave, E.: Few-shot learning with retrieval augmented language models. arXiv preprint arXiv:2208.03299 (2022) Wei et al. [2022] Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E., Le, Q., Zhou, D.: Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903 (2022) Wang et al. [2023] Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-Consistency Improves Chain of Thought Reasoning in Language Models (2023) Shao et al. [2023] Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Phuong, M., Hutter, M.: Formal algorithms for transformers. arXiv preprint arXiv:2207.09238 (2022) Ji et al. [2023] Ji, Z., Lee, N., Frieske, R., Yu, T., Su, D., Xu, Y., Ishii, E., Bang, Y.J., Madotto, A., Fung, P.: Survey of hallucination in natural language generation. ACM Comput. Surv. 55(12) (2023) https://doi.org/10.1145/3571730 Strubell et al. [2020] Strubell, E., Ganesh, A., McCallum, A.: Energy and policy considerations for modern deep learning research. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, pp. 13693–13696 (2020) Patterson et al. [2021] Patterson, D., Gonzalez, J., Le, Q., Liang, C., Munguia, L.-M., Rothchild, D., So, D., Texier, M., Dean, J.: Carbon emissions and large neural network training. arXiv preprint arXiv:2104.10350 (2021) Wu et al. [2022] Wu, C.-J., Raghavendra, R., Gupta, U., Acun, B., Ardalani, N., Maeng, K., Chang, G., Aga, F., Huang, J., Bai, C., et al.: Sustainable ai: Environmental implications, challenges and opportunities. Proceedings of Machine Learning and Systems 4, 795–813 (2022) Kaack et al. [2022] Kaack, L.H., Donti, P.L., Strubell, E., Kamiya, G., Creutzig, F., Rolnick, D.: Aligning artificial intelligence with climate change mitigation. Nature Climate Change 12(6), 518–527 (2022) Zhou et al. [2022] Zhou, H., Nova, A., Larochelle, H., Courville, A., Neyshabur, B., Sedghi, H.: Teaching algorithmic reasoning via in-context learning. arXiv preprint arXiv:2211.09066 (2022) Izacard et al. [2022] Izacard, G., Lewis, P., Lomeli, M., Hosseini, L., Petroni, F., Schick, T., Dwivedi-Yu, J., Joulin, A., Riedel, S., Grave, E.: Few-shot learning with retrieval augmented language models. arXiv preprint arXiv:2208.03299 (2022) Wei et al. [2022] Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E., Le, Q., Zhou, D.: Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903 (2022) Wang et al. [2023] Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-Consistency Improves Chain of Thought Reasoning in Language Models (2023) Shao et al. [2023] Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Ji, Z., Lee, N., Frieske, R., Yu, T., Su, D., Xu, Y., Ishii, E., Bang, Y.J., Madotto, A., Fung, P.: Survey of hallucination in natural language generation. ACM Comput. Surv. 55(12) (2023) https://doi.org/10.1145/3571730 Strubell et al. [2020] Strubell, E., Ganesh, A., McCallum, A.: Energy and policy considerations for modern deep learning research. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, pp. 13693–13696 (2020) Patterson et al. [2021] Patterson, D., Gonzalez, J., Le, Q., Liang, C., Munguia, L.-M., Rothchild, D., So, D., Texier, M., Dean, J.: Carbon emissions and large neural network training. arXiv preprint arXiv:2104.10350 (2021) Wu et al. [2022] Wu, C.-J., Raghavendra, R., Gupta, U., Acun, B., Ardalani, N., Maeng, K., Chang, G., Aga, F., Huang, J., Bai, C., et al.: Sustainable ai: Environmental implications, challenges and opportunities. Proceedings of Machine Learning and Systems 4, 795–813 (2022) Kaack et al. [2022] Kaack, L.H., Donti, P.L., Strubell, E., Kamiya, G., Creutzig, F., Rolnick, D.: Aligning artificial intelligence with climate change mitigation. Nature Climate Change 12(6), 518–527 (2022) Zhou et al. [2022] Zhou, H., Nova, A., Larochelle, H., Courville, A., Neyshabur, B., Sedghi, H.: Teaching algorithmic reasoning via in-context learning. arXiv preprint arXiv:2211.09066 (2022) Izacard et al. [2022] Izacard, G., Lewis, P., Lomeli, M., Hosseini, L., Petroni, F., Schick, T., Dwivedi-Yu, J., Joulin, A., Riedel, S., Grave, E.: Few-shot learning with retrieval augmented language models. arXiv preprint arXiv:2208.03299 (2022) Wei et al. [2022] Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E., Le, Q., Zhou, D.: Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903 (2022) Wang et al. [2023] Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-Consistency Improves Chain of Thought Reasoning in Language Models (2023) Shao et al. [2023] Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Strubell, E., Ganesh, A., McCallum, A.: Energy and policy considerations for modern deep learning research. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, pp. 13693–13696 (2020) Patterson et al. [2021] Patterson, D., Gonzalez, J., Le, Q., Liang, C., Munguia, L.-M., Rothchild, D., So, D., Texier, M., Dean, J.: Carbon emissions and large neural network training. arXiv preprint arXiv:2104.10350 (2021) Wu et al. [2022] Wu, C.-J., Raghavendra, R., Gupta, U., Acun, B., Ardalani, N., Maeng, K., Chang, G., Aga, F., Huang, J., Bai, C., et al.: Sustainable ai: Environmental implications, challenges and opportunities. Proceedings of Machine Learning and Systems 4, 795–813 (2022) Kaack et al. [2022] Kaack, L.H., Donti, P.L., Strubell, E., Kamiya, G., Creutzig, F., Rolnick, D.: Aligning artificial intelligence with climate change mitigation. Nature Climate Change 12(6), 518–527 (2022) Zhou et al. [2022] Zhou, H., Nova, A., Larochelle, H., Courville, A., Neyshabur, B., Sedghi, H.: Teaching algorithmic reasoning via in-context learning. arXiv preprint arXiv:2211.09066 (2022) Izacard et al. [2022] Izacard, G., Lewis, P., Lomeli, M., Hosseini, L., Petroni, F., Schick, T., Dwivedi-Yu, J., Joulin, A., Riedel, S., Grave, E.: Few-shot learning with retrieval augmented language models. arXiv preprint arXiv:2208.03299 (2022) Wei et al. [2022] Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E., Le, Q., Zhou, D.: Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903 (2022) Wang et al. [2023] Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-Consistency Improves Chain of Thought Reasoning in Language Models (2023) Shao et al. [2023] Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Patterson, D., Gonzalez, J., Le, Q., Liang, C., Munguia, L.-M., Rothchild, D., So, D., Texier, M., Dean, J.: Carbon emissions and large neural network training. arXiv preprint arXiv:2104.10350 (2021) Wu et al. [2022] Wu, C.-J., Raghavendra, R., Gupta, U., Acun, B., Ardalani, N., Maeng, K., Chang, G., Aga, F., Huang, J., Bai, C., et al.: Sustainable ai: Environmental implications, challenges and opportunities. Proceedings of Machine Learning and Systems 4, 795–813 (2022) Kaack et al. [2022] Kaack, L.H., Donti, P.L., Strubell, E., Kamiya, G., Creutzig, F., Rolnick, D.: Aligning artificial intelligence with climate change mitigation. Nature Climate Change 12(6), 518–527 (2022) Zhou et al. [2022] Zhou, H., Nova, A., Larochelle, H., Courville, A., Neyshabur, B., Sedghi, H.: Teaching algorithmic reasoning via in-context learning. arXiv preprint arXiv:2211.09066 (2022) Izacard et al. [2022] Izacard, G., Lewis, P., Lomeli, M., Hosseini, L., Petroni, F., Schick, T., Dwivedi-Yu, J., Joulin, A., Riedel, S., Grave, E.: Few-shot learning with retrieval augmented language models. arXiv preprint arXiv:2208.03299 (2022) Wei et al. [2022] Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E., Le, Q., Zhou, D.: Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903 (2022) Wang et al. [2023] Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-Consistency Improves Chain of Thought Reasoning in Language Models (2023) Shao et al. [2023] Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Wu, C.-J., Raghavendra, R., Gupta, U., Acun, B., Ardalani, N., Maeng, K., Chang, G., Aga, F., Huang, J., Bai, C., et al.: Sustainable ai: Environmental implications, challenges and opportunities. Proceedings of Machine Learning and Systems 4, 795–813 (2022) Kaack et al. [2022] Kaack, L.H., Donti, P.L., Strubell, E., Kamiya, G., Creutzig, F., Rolnick, D.: Aligning artificial intelligence with climate change mitigation. Nature Climate Change 12(6), 518–527 (2022) Zhou et al. [2022] Zhou, H., Nova, A., Larochelle, H., Courville, A., Neyshabur, B., Sedghi, H.: Teaching algorithmic reasoning via in-context learning. arXiv preprint arXiv:2211.09066 (2022) Izacard et al. [2022] Izacard, G., Lewis, P., Lomeli, M., Hosseini, L., Petroni, F., Schick, T., Dwivedi-Yu, J., Joulin, A., Riedel, S., Grave, E.: Few-shot learning with retrieval augmented language models. arXiv preprint arXiv:2208.03299 (2022) Wei et al. [2022] Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E., Le, Q., Zhou, D.: Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903 (2022) Wang et al. [2023] Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-Consistency Improves Chain of Thought Reasoning in Language Models (2023) Shao et al. [2023] Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Kaack, L.H., Donti, P.L., Strubell, E., Kamiya, G., Creutzig, F., Rolnick, D.: Aligning artificial intelligence with climate change mitigation. Nature Climate Change 12(6), 518–527 (2022) Zhou et al. [2022] Zhou, H., Nova, A., Larochelle, H., Courville, A., Neyshabur, B., Sedghi, H.: Teaching algorithmic reasoning via in-context learning. arXiv preprint arXiv:2211.09066 (2022) Izacard et al. [2022] Izacard, G., Lewis, P., Lomeli, M., Hosseini, L., Petroni, F., Schick, T., Dwivedi-Yu, J., Joulin, A., Riedel, S., Grave, E.: Few-shot learning with retrieval augmented language models. arXiv preprint arXiv:2208.03299 (2022) Wei et al. [2022] Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E., Le, Q., Zhou, D.: Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903 (2022) Wang et al. [2023] Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-Consistency Improves Chain of Thought Reasoning in Language Models (2023) Shao et al. [2023] Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Zhou, H., Nova, A., Larochelle, H., Courville, A., Neyshabur, B., Sedghi, H.: Teaching algorithmic reasoning via in-context learning. arXiv preprint arXiv:2211.09066 (2022) Izacard et al. [2022] Izacard, G., Lewis, P., Lomeli, M., Hosseini, L., Petroni, F., Schick, T., Dwivedi-Yu, J., Joulin, A., Riedel, S., Grave, E.: Few-shot learning with retrieval augmented language models. arXiv preprint arXiv:2208.03299 (2022) Wei et al. [2022] Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E., Le, Q., Zhou, D.: Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903 (2022) Wang et al. [2023] Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-Consistency Improves Chain of Thought Reasoning in Language Models (2023) Shao et al. [2023] Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Izacard, G., Lewis, P., Lomeli, M., Hosseini, L., Petroni, F., Schick, T., Dwivedi-Yu, J., Joulin, A., Riedel, S., Grave, E.: Few-shot learning with retrieval augmented language models. arXiv preprint arXiv:2208.03299 (2022) Wei et al. [2022] Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E., Le, Q., Zhou, D.: Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903 (2022) Wang et al. [2023] Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-Consistency Improves Chain of Thought Reasoning in Language Models (2023) Shao et al. [2023] Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E., Le, Q., Zhou, D.: Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903 (2022) Wang et al. [2023] Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-Consistency Improves Chain of Thought Reasoning in Language Models (2023) Shao et al. [2023] Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-Consistency Improves Chain of Thought Reasoning in Language Models (2023) Shao et al. [2023] Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023)
  2. Bradley, H., Fan, H., Galanos, T., Zhou, R., Scott, D., Lehman, J.: The openelm library: Leveraging progress in language models for novel evolutionary algorithms. In: Genetic Programming Theory and Practice XX. Springer, ??? (2024) Chen et al. [2023] Chen, A., Dohan, D.M., So, D.R.: Evoprompting: Language models for code-level neural architecture search. arXiv preprint arXiv:2302.14838 (2023) Liventsev et al. [2023] Liventsev, V., Grishina, A., Härmä, A., Moonen, L.: Fully autonomous programming with large language models. arXiv preprint arXiv:2304.10423 (2023) O’Neill et al. [2010] O’Neill, M., Vanneschi, L., Gustafson, S., Banzhaf, W.: Open issues in genetic programming. Genetic Programming and Evolvable Machines 11, 339–363 (2010) O’Neill and Spector [2020] O’Neill, M., Spector, L.: Automatic programming: The open issue? Genetic Programming and Evolvable Machines 21, 251–262 (2020) Liu et al. [2023] Liu, P., Yuan, W., Fu, J., Jiang, Z., Hayashi, H., Neubig, G.: Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing. ACM Computing Surveys 55(9), 1–35 (2023) Radford et al. [2019] Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., Sutskever, I., et al.: Language models are unsupervised multitask learners. OpenAI blog 1(8), 9 (2019) Brown et al. [2020] Brown, T.B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D.M., Wu, J., Winter, C., Hesse, C., Chen, M., Sigler, E., Litwin, M., Gray, S., Chess, B., Clark, J., Berner, C., McCandlish, S., Radford, A., Sutskever, I., Amodei, D.: Language Models are Few-Shot Learners (2020) OpenAI [2023] OpenAI: GPT-4 Technical Report (2023) Phuong and Hutter [2022] Phuong, M., Hutter, M.: Formal algorithms for transformers. arXiv preprint arXiv:2207.09238 (2022) Ji et al. [2023] Ji, Z., Lee, N., Frieske, R., Yu, T., Su, D., Xu, Y., Ishii, E., Bang, Y.J., Madotto, A., Fung, P.: Survey of hallucination in natural language generation. ACM Comput. Surv. 55(12) (2023) https://doi.org/10.1145/3571730 Strubell et al. [2020] Strubell, E., Ganesh, A., McCallum, A.: Energy and policy considerations for modern deep learning research. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, pp. 13693–13696 (2020) Patterson et al. [2021] Patterson, D., Gonzalez, J., Le, Q., Liang, C., Munguia, L.-M., Rothchild, D., So, D., Texier, M., Dean, J.: Carbon emissions and large neural network training. arXiv preprint arXiv:2104.10350 (2021) Wu et al. [2022] Wu, C.-J., Raghavendra, R., Gupta, U., Acun, B., Ardalani, N., Maeng, K., Chang, G., Aga, F., Huang, J., Bai, C., et al.: Sustainable ai: Environmental implications, challenges and opportunities. Proceedings of Machine Learning and Systems 4, 795–813 (2022) Kaack et al. [2022] Kaack, L.H., Donti, P.L., Strubell, E., Kamiya, G., Creutzig, F., Rolnick, D.: Aligning artificial intelligence with climate change mitigation. Nature Climate Change 12(6), 518–527 (2022) Zhou et al. [2022] Zhou, H., Nova, A., Larochelle, H., Courville, A., Neyshabur, B., Sedghi, H.: Teaching algorithmic reasoning via in-context learning. arXiv preprint arXiv:2211.09066 (2022) Izacard et al. [2022] Izacard, G., Lewis, P., Lomeli, M., Hosseini, L., Petroni, F., Schick, T., Dwivedi-Yu, J., Joulin, A., Riedel, S., Grave, E.: Few-shot learning with retrieval augmented language models. arXiv preprint arXiv:2208.03299 (2022) Wei et al. [2022] Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E., Le, Q., Zhou, D.: Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903 (2022) Wang et al. [2023] Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-Consistency Improves Chain of Thought Reasoning in Language Models (2023) Shao et al. [2023] Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Chen, A., Dohan, D.M., So, D.R.: Evoprompting: Language models for code-level neural architecture search. arXiv preprint arXiv:2302.14838 (2023) Liventsev et al. [2023] Liventsev, V., Grishina, A., Härmä, A., Moonen, L.: Fully autonomous programming with large language models. arXiv preprint arXiv:2304.10423 (2023) O’Neill et al. [2010] O’Neill, M., Vanneschi, L., Gustafson, S., Banzhaf, W.: Open issues in genetic programming. Genetic Programming and Evolvable Machines 11, 339–363 (2010) O’Neill and Spector [2020] O’Neill, M., Spector, L.: Automatic programming: The open issue? Genetic Programming and Evolvable Machines 21, 251–262 (2020) Liu et al. [2023] Liu, P., Yuan, W., Fu, J., Jiang, Z., Hayashi, H., Neubig, G.: Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing. ACM Computing Surveys 55(9), 1–35 (2023) Radford et al. [2019] Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., Sutskever, I., et al.: Language models are unsupervised multitask learners. OpenAI blog 1(8), 9 (2019) Brown et al. [2020] Brown, T.B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D.M., Wu, J., Winter, C., Hesse, C., Chen, M., Sigler, E., Litwin, M., Gray, S., Chess, B., Clark, J., Berner, C., McCandlish, S., Radford, A., Sutskever, I., Amodei, D.: Language Models are Few-Shot Learners (2020) OpenAI [2023] OpenAI: GPT-4 Technical Report (2023) Phuong and Hutter [2022] Phuong, M., Hutter, M.: Formal algorithms for transformers. arXiv preprint arXiv:2207.09238 (2022) Ji et al. [2023] Ji, Z., Lee, N., Frieske, R., Yu, T., Su, D., Xu, Y., Ishii, E., Bang, Y.J., Madotto, A., Fung, P.: Survey of hallucination in natural language generation. ACM Comput. Surv. 55(12) (2023) https://doi.org/10.1145/3571730 Strubell et al. [2020] Strubell, E., Ganesh, A., McCallum, A.: Energy and policy considerations for modern deep learning research. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, pp. 13693–13696 (2020) Patterson et al. [2021] Patterson, D., Gonzalez, J., Le, Q., Liang, C., Munguia, L.-M., Rothchild, D., So, D., Texier, M., Dean, J.: Carbon emissions and large neural network training. arXiv preprint arXiv:2104.10350 (2021) Wu et al. [2022] Wu, C.-J., Raghavendra, R., Gupta, U., Acun, B., Ardalani, N., Maeng, K., Chang, G., Aga, F., Huang, J., Bai, C., et al.: Sustainable ai: Environmental implications, challenges and opportunities. Proceedings of Machine Learning and Systems 4, 795–813 (2022) Kaack et al. [2022] Kaack, L.H., Donti, P.L., Strubell, E., Kamiya, G., Creutzig, F., Rolnick, D.: Aligning artificial intelligence with climate change mitigation. Nature Climate Change 12(6), 518–527 (2022) Zhou et al. [2022] Zhou, H., Nova, A., Larochelle, H., Courville, A., Neyshabur, B., Sedghi, H.: Teaching algorithmic reasoning via in-context learning. arXiv preprint arXiv:2211.09066 (2022) Izacard et al. [2022] Izacard, G., Lewis, P., Lomeli, M., Hosseini, L., Petroni, F., Schick, T., Dwivedi-Yu, J., Joulin, A., Riedel, S., Grave, E.: Few-shot learning with retrieval augmented language models. arXiv preprint arXiv:2208.03299 (2022) Wei et al. [2022] Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E., Le, Q., Zhou, D.: Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903 (2022) Wang et al. [2023] Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-Consistency Improves Chain of Thought Reasoning in Language Models (2023) Shao et al. [2023] Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Liventsev, V., Grishina, A., Härmä, A., Moonen, L.: Fully autonomous programming with large language models. arXiv preprint arXiv:2304.10423 (2023) O’Neill et al. [2010] O’Neill, M., Vanneschi, L., Gustafson, S., Banzhaf, W.: Open issues in genetic programming. Genetic Programming and Evolvable Machines 11, 339–363 (2010) O’Neill and Spector [2020] O’Neill, M., Spector, L.: Automatic programming: The open issue? Genetic Programming and Evolvable Machines 21, 251–262 (2020) Liu et al. [2023] Liu, P., Yuan, W., Fu, J., Jiang, Z., Hayashi, H., Neubig, G.: Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing. ACM Computing Surveys 55(9), 1–35 (2023) Radford et al. [2019] Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., Sutskever, I., et al.: Language models are unsupervised multitask learners. OpenAI blog 1(8), 9 (2019) Brown et al. [2020] Brown, T.B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D.M., Wu, J., Winter, C., Hesse, C., Chen, M., Sigler, E., Litwin, M., Gray, S., Chess, B., Clark, J., Berner, C., McCandlish, S., Radford, A., Sutskever, I., Amodei, D.: Language Models are Few-Shot Learners (2020) OpenAI [2023] OpenAI: GPT-4 Technical Report (2023) Phuong and Hutter [2022] Phuong, M., Hutter, M.: Formal algorithms for transformers. arXiv preprint arXiv:2207.09238 (2022) Ji et al. [2023] Ji, Z., Lee, N., Frieske, R., Yu, T., Su, D., Xu, Y., Ishii, E., Bang, Y.J., Madotto, A., Fung, P.: Survey of hallucination in natural language generation. ACM Comput. Surv. 55(12) (2023) https://doi.org/10.1145/3571730 Strubell et al. [2020] Strubell, E., Ganesh, A., McCallum, A.: Energy and policy considerations for modern deep learning research. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, pp. 13693–13696 (2020) Patterson et al. [2021] Patterson, D., Gonzalez, J., Le, Q., Liang, C., Munguia, L.-M., Rothchild, D., So, D., Texier, M., Dean, J.: Carbon emissions and large neural network training. arXiv preprint arXiv:2104.10350 (2021) Wu et al. [2022] Wu, C.-J., Raghavendra, R., Gupta, U., Acun, B., Ardalani, N., Maeng, K., Chang, G., Aga, F., Huang, J., Bai, C., et al.: Sustainable ai: Environmental implications, challenges and opportunities. Proceedings of Machine Learning and Systems 4, 795–813 (2022) Kaack et al. [2022] Kaack, L.H., Donti, P.L., Strubell, E., Kamiya, G., Creutzig, F., Rolnick, D.: Aligning artificial intelligence with climate change mitigation. Nature Climate Change 12(6), 518–527 (2022) Zhou et al. [2022] Zhou, H., Nova, A., Larochelle, H., Courville, A., Neyshabur, B., Sedghi, H.: Teaching algorithmic reasoning via in-context learning. arXiv preprint arXiv:2211.09066 (2022) Izacard et al. [2022] Izacard, G., Lewis, P., Lomeli, M., Hosseini, L., Petroni, F., Schick, T., Dwivedi-Yu, J., Joulin, A., Riedel, S., Grave, E.: Few-shot learning with retrieval augmented language models. arXiv preprint arXiv:2208.03299 (2022) Wei et al. [2022] Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E., Le, Q., Zhou, D.: Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903 (2022) Wang et al. [2023] Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-Consistency Improves Chain of Thought Reasoning in Language Models (2023) Shao et al. [2023] Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) O’Neill, M., Vanneschi, L., Gustafson, S., Banzhaf, W.: Open issues in genetic programming. Genetic Programming and Evolvable Machines 11, 339–363 (2010) O’Neill and Spector [2020] O’Neill, M., Spector, L.: Automatic programming: The open issue? Genetic Programming and Evolvable Machines 21, 251–262 (2020) Liu et al. [2023] Liu, P., Yuan, W., Fu, J., Jiang, Z., Hayashi, H., Neubig, G.: Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing. ACM Computing Surveys 55(9), 1–35 (2023) Radford et al. [2019] Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., Sutskever, I., et al.: Language models are unsupervised multitask learners. OpenAI blog 1(8), 9 (2019) Brown et al. [2020] Brown, T.B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D.M., Wu, J., Winter, C., Hesse, C., Chen, M., Sigler, E., Litwin, M., Gray, S., Chess, B., Clark, J., Berner, C., McCandlish, S., Radford, A., Sutskever, I., Amodei, D.: Language Models are Few-Shot Learners (2020) OpenAI [2023] OpenAI: GPT-4 Technical Report (2023) Phuong and Hutter [2022] Phuong, M., Hutter, M.: Formal algorithms for transformers. arXiv preprint arXiv:2207.09238 (2022) Ji et al. [2023] Ji, Z., Lee, N., Frieske, R., Yu, T., Su, D., Xu, Y., Ishii, E., Bang, Y.J., Madotto, A., Fung, P.: Survey of hallucination in natural language generation. ACM Comput. Surv. 55(12) (2023) https://doi.org/10.1145/3571730 Strubell et al. [2020] Strubell, E., Ganesh, A., McCallum, A.: Energy and policy considerations for modern deep learning research. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, pp. 13693–13696 (2020) Patterson et al. [2021] Patterson, D., Gonzalez, J., Le, Q., Liang, C., Munguia, L.-M., Rothchild, D., So, D., Texier, M., Dean, J.: Carbon emissions and large neural network training. arXiv preprint arXiv:2104.10350 (2021) Wu et al. [2022] Wu, C.-J., Raghavendra, R., Gupta, U., Acun, B., Ardalani, N., Maeng, K., Chang, G., Aga, F., Huang, J., Bai, C., et al.: Sustainable ai: Environmental implications, challenges and opportunities. Proceedings of Machine Learning and Systems 4, 795–813 (2022) Kaack et al. [2022] Kaack, L.H., Donti, P.L., Strubell, E., Kamiya, G., Creutzig, F., Rolnick, D.: Aligning artificial intelligence with climate change mitigation. Nature Climate Change 12(6), 518–527 (2022) Zhou et al. [2022] Zhou, H., Nova, A., Larochelle, H., Courville, A., Neyshabur, B., Sedghi, H.: Teaching algorithmic reasoning via in-context learning. arXiv preprint arXiv:2211.09066 (2022) Izacard et al. [2022] Izacard, G., Lewis, P., Lomeli, M., Hosseini, L., Petroni, F., Schick, T., Dwivedi-Yu, J., Joulin, A., Riedel, S., Grave, E.: Few-shot learning with retrieval augmented language models. arXiv preprint arXiv:2208.03299 (2022) Wei et al. [2022] Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E., Le, Q., Zhou, D.: Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903 (2022) Wang et al. [2023] Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-Consistency Improves Chain of Thought Reasoning in Language Models (2023) Shao et al. [2023] Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) O’Neill, M., Spector, L.: Automatic programming: The open issue? Genetic Programming and Evolvable Machines 21, 251–262 (2020) Liu et al. [2023] Liu, P., Yuan, W., Fu, J., Jiang, Z., Hayashi, H., Neubig, G.: Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing. ACM Computing Surveys 55(9), 1–35 (2023) Radford et al. [2019] Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., Sutskever, I., et al.: Language models are unsupervised multitask learners. OpenAI blog 1(8), 9 (2019) Brown et al. [2020] Brown, T.B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D.M., Wu, J., Winter, C., Hesse, C., Chen, M., Sigler, E., Litwin, M., Gray, S., Chess, B., Clark, J., Berner, C., McCandlish, S., Radford, A., Sutskever, I., Amodei, D.: Language Models are Few-Shot Learners (2020) OpenAI [2023] OpenAI: GPT-4 Technical Report (2023) Phuong and Hutter [2022] Phuong, M., Hutter, M.: Formal algorithms for transformers. arXiv preprint arXiv:2207.09238 (2022) Ji et al. [2023] Ji, Z., Lee, N., Frieske, R., Yu, T., Su, D., Xu, Y., Ishii, E., Bang, Y.J., Madotto, A., Fung, P.: Survey of hallucination in natural language generation. ACM Comput. Surv. 55(12) (2023) https://doi.org/10.1145/3571730 Strubell et al. [2020] Strubell, E., Ganesh, A., McCallum, A.: Energy and policy considerations for modern deep learning research. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, pp. 13693–13696 (2020) Patterson et al. [2021] Patterson, D., Gonzalez, J., Le, Q., Liang, C., Munguia, L.-M., Rothchild, D., So, D., Texier, M., Dean, J.: Carbon emissions and large neural network training. arXiv preprint arXiv:2104.10350 (2021) Wu et al. [2022] Wu, C.-J., Raghavendra, R., Gupta, U., Acun, B., Ardalani, N., Maeng, K., Chang, G., Aga, F., Huang, J., Bai, C., et al.: Sustainable ai: Environmental implications, challenges and opportunities. Proceedings of Machine Learning and Systems 4, 795–813 (2022) Kaack et al. [2022] Kaack, L.H., Donti, P.L., Strubell, E., Kamiya, G., Creutzig, F., Rolnick, D.: Aligning artificial intelligence with climate change mitigation. Nature Climate Change 12(6), 518–527 (2022) Zhou et al. [2022] Zhou, H., Nova, A., Larochelle, H., Courville, A., Neyshabur, B., Sedghi, H.: Teaching algorithmic reasoning via in-context learning. arXiv preprint arXiv:2211.09066 (2022) Izacard et al. [2022] Izacard, G., Lewis, P., Lomeli, M., Hosseini, L., Petroni, F., Schick, T., Dwivedi-Yu, J., Joulin, A., Riedel, S., Grave, E.: Few-shot learning with retrieval augmented language models. arXiv preprint arXiv:2208.03299 (2022) Wei et al. [2022] Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E., Le, Q., Zhou, D.: Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903 (2022) Wang et al. [2023] Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-Consistency Improves Chain of Thought Reasoning in Language Models (2023) Shao et al. [2023] Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Liu, P., Yuan, W., Fu, J., Jiang, Z., Hayashi, H., Neubig, G.: Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing. ACM Computing Surveys 55(9), 1–35 (2023) Radford et al. [2019] Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., Sutskever, I., et al.: Language models are unsupervised multitask learners. OpenAI blog 1(8), 9 (2019) Brown et al. [2020] Brown, T.B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D.M., Wu, J., Winter, C., Hesse, C., Chen, M., Sigler, E., Litwin, M., Gray, S., Chess, B., Clark, J., Berner, C., McCandlish, S., Radford, A., Sutskever, I., Amodei, D.: Language Models are Few-Shot Learners (2020) OpenAI [2023] OpenAI: GPT-4 Technical Report (2023) Phuong and Hutter [2022] Phuong, M., Hutter, M.: Formal algorithms for transformers. arXiv preprint arXiv:2207.09238 (2022) Ji et al. [2023] Ji, Z., Lee, N., Frieske, R., Yu, T., Su, D., Xu, Y., Ishii, E., Bang, Y.J., Madotto, A., Fung, P.: Survey of hallucination in natural language generation. ACM Comput. Surv. 55(12) (2023) https://doi.org/10.1145/3571730 Strubell et al. [2020] Strubell, E., Ganesh, A., McCallum, A.: Energy and policy considerations for modern deep learning research. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, pp. 13693–13696 (2020) Patterson et al. [2021] Patterson, D., Gonzalez, J., Le, Q., Liang, C., Munguia, L.-M., Rothchild, D., So, D., Texier, M., Dean, J.: Carbon emissions and large neural network training. arXiv preprint arXiv:2104.10350 (2021) Wu et al. [2022] Wu, C.-J., Raghavendra, R., Gupta, U., Acun, B., Ardalani, N., Maeng, K., Chang, G., Aga, F., Huang, J., Bai, C., et al.: Sustainable ai: Environmental implications, challenges and opportunities. Proceedings of Machine Learning and Systems 4, 795–813 (2022) Kaack et al. [2022] Kaack, L.H., Donti, P.L., Strubell, E., Kamiya, G., Creutzig, F., Rolnick, D.: Aligning artificial intelligence with climate change mitigation. Nature Climate Change 12(6), 518–527 (2022) Zhou et al. [2022] Zhou, H., Nova, A., Larochelle, H., Courville, A., Neyshabur, B., Sedghi, H.: Teaching algorithmic reasoning via in-context learning. arXiv preprint arXiv:2211.09066 (2022) Izacard et al. [2022] Izacard, G., Lewis, P., Lomeli, M., Hosseini, L., Petroni, F., Schick, T., Dwivedi-Yu, J., Joulin, A., Riedel, S., Grave, E.: Few-shot learning with retrieval augmented language models. arXiv preprint arXiv:2208.03299 (2022) Wei et al. [2022] Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E., Le, Q., Zhou, D.: Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903 (2022) Wang et al. [2023] Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-Consistency Improves Chain of Thought Reasoning in Language Models (2023) Shao et al. [2023] Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., Sutskever, I., et al.: Language models are unsupervised multitask learners. OpenAI blog 1(8), 9 (2019) Brown et al. [2020] Brown, T.B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D.M., Wu, J., Winter, C., Hesse, C., Chen, M., Sigler, E., Litwin, M., Gray, S., Chess, B., Clark, J., Berner, C., McCandlish, S., Radford, A., Sutskever, I., Amodei, D.: Language Models are Few-Shot Learners (2020) OpenAI [2023] OpenAI: GPT-4 Technical Report (2023) Phuong and Hutter [2022] Phuong, M., Hutter, M.: Formal algorithms for transformers. arXiv preprint arXiv:2207.09238 (2022) Ji et al. [2023] Ji, Z., Lee, N., Frieske, R., Yu, T., Su, D., Xu, Y., Ishii, E., Bang, Y.J., Madotto, A., Fung, P.: Survey of hallucination in natural language generation. ACM Comput. Surv. 55(12) (2023) https://doi.org/10.1145/3571730 Strubell et al. [2020] Strubell, E., Ganesh, A., McCallum, A.: Energy and policy considerations for modern deep learning research. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, pp. 13693–13696 (2020) Patterson et al. [2021] Patterson, D., Gonzalez, J., Le, Q., Liang, C., Munguia, L.-M., Rothchild, D., So, D., Texier, M., Dean, J.: Carbon emissions and large neural network training. arXiv preprint arXiv:2104.10350 (2021) Wu et al. [2022] Wu, C.-J., Raghavendra, R., Gupta, U., Acun, B., Ardalani, N., Maeng, K., Chang, G., Aga, F., Huang, J., Bai, C., et al.: Sustainable ai: Environmental implications, challenges and opportunities. Proceedings of Machine Learning and Systems 4, 795–813 (2022) Kaack et al. [2022] Kaack, L.H., Donti, P.L., Strubell, E., Kamiya, G., Creutzig, F., Rolnick, D.: Aligning artificial intelligence with climate change mitigation. Nature Climate Change 12(6), 518–527 (2022) Zhou et al. [2022] Zhou, H., Nova, A., Larochelle, H., Courville, A., Neyshabur, B., Sedghi, H.: Teaching algorithmic reasoning via in-context learning. arXiv preprint arXiv:2211.09066 (2022) Izacard et al. [2022] Izacard, G., Lewis, P., Lomeli, M., Hosseini, L., Petroni, F., Schick, T., Dwivedi-Yu, J., Joulin, A., Riedel, S., Grave, E.: Few-shot learning with retrieval augmented language models. arXiv preprint arXiv:2208.03299 (2022) Wei et al. [2022] Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E., Le, Q., Zhou, D.: Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903 (2022) Wang et al. [2023] Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-Consistency Improves Chain of Thought Reasoning in Language Models (2023) Shao et al. [2023] Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Brown, T.B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D.M., Wu, J., Winter, C., Hesse, C., Chen, M., Sigler, E., Litwin, M., Gray, S., Chess, B., Clark, J., Berner, C., McCandlish, S., Radford, A., Sutskever, I., Amodei, D.: Language Models are Few-Shot Learners (2020) OpenAI [2023] OpenAI: GPT-4 Technical Report (2023) Phuong and Hutter [2022] Phuong, M., Hutter, M.: Formal algorithms for transformers. arXiv preprint arXiv:2207.09238 (2022) Ji et al. [2023] Ji, Z., Lee, N., Frieske, R., Yu, T., Su, D., Xu, Y., Ishii, E., Bang, Y.J., Madotto, A., Fung, P.: Survey of hallucination in natural language generation. ACM Comput. Surv. 55(12) (2023) https://doi.org/10.1145/3571730 Strubell et al. [2020] Strubell, E., Ganesh, A., McCallum, A.: Energy and policy considerations for modern deep learning research. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, pp. 13693–13696 (2020) Patterson et al. [2021] Patterson, D., Gonzalez, J., Le, Q., Liang, C., Munguia, L.-M., Rothchild, D., So, D., Texier, M., Dean, J.: Carbon emissions and large neural network training. arXiv preprint arXiv:2104.10350 (2021) Wu et al. [2022] Wu, C.-J., Raghavendra, R., Gupta, U., Acun, B., Ardalani, N., Maeng, K., Chang, G., Aga, F., Huang, J., Bai, C., et al.: Sustainable ai: Environmental implications, challenges and opportunities. Proceedings of Machine Learning and Systems 4, 795–813 (2022) Kaack et al. [2022] Kaack, L.H., Donti, P.L., Strubell, E., Kamiya, G., Creutzig, F., Rolnick, D.: Aligning artificial intelligence with climate change mitigation. Nature Climate Change 12(6), 518–527 (2022) Zhou et al. [2022] Zhou, H., Nova, A., Larochelle, H., Courville, A., Neyshabur, B., Sedghi, H.: Teaching algorithmic reasoning via in-context learning. arXiv preprint arXiv:2211.09066 (2022) Izacard et al. [2022] Izacard, G., Lewis, P., Lomeli, M., Hosseini, L., Petroni, F., Schick, T., Dwivedi-Yu, J., Joulin, A., Riedel, S., Grave, E.: Few-shot learning with retrieval augmented language models. arXiv preprint arXiv:2208.03299 (2022) Wei et al. [2022] Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E., Le, Q., Zhou, D.: Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903 (2022) Wang et al. [2023] Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-Consistency Improves Chain of Thought Reasoning in Language Models (2023) Shao et al. [2023] Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) OpenAI: GPT-4 Technical Report (2023) Phuong and Hutter [2022] Phuong, M., Hutter, M.: Formal algorithms for transformers. arXiv preprint arXiv:2207.09238 (2022) Ji et al. [2023] Ji, Z., Lee, N., Frieske, R., Yu, T., Su, D., Xu, Y., Ishii, E., Bang, Y.J., Madotto, A., Fung, P.: Survey of hallucination in natural language generation. ACM Comput. Surv. 55(12) (2023) https://doi.org/10.1145/3571730 Strubell et al. [2020] Strubell, E., Ganesh, A., McCallum, A.: Energy and policy considerations for modern deep learning research. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, pp. 13693–13696 (2020) Patterson et al. [2021] Patterson, D., Gonzalez, J., Le, Q., Liang, C., Munguia, L.-M., Rothchild, D., So, D., Texier, M., Dean, J.: Carbon emissions and large neural network training. arXiv preprint arXiv:2104.10350 (2021) Wu et al. [2022] Wu, C.-J., Raghavendra, R., Gupta, U., Acun, B., Ardalani, N., Maeng, K., Chang, G., Aga, F., Huang, J., Bai, C., et al.: Sustainable ai: Environmental implications, challenges and opportunities. Proceedings of Machine Learning and Systems 4, 795–813 (2022) Kaack et al. [2022] Kaack, L.H., Donti, P.L., Strubell, E., Kamiya, G., Creutzig, F., Rolnick, D.: Aligning artificial intelligence with climate change mitigation. Nature Climate Change 12(6), 518–527 (2022) Zhou et al. [2022] Zhou, H., Nova, A., Larochelle, H., Courville, A., Neyshabur, B., Sedghi, H.: Teaching algorithmic reasoning via in-context learning. arXiv preprint arXiv:2211.09066 (2022) Izacard et al. [2022] Izacard, G., Lewis, P., Lomeli, M., Hosseini, L., Petroni, F., Schick, T., Dwivedi-Yu, J., Joulin, A., Riedel, S., Grave, E.: Few-shot learning with retrieval augmented language models. arXiv preprint arXiv:2208.03299 (2022) Wei et al. [2022] Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E., Le, Q., Zhou, D.: Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903 (2022) Wang et al. [2023] Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-Consistency Improves Chain of Thought Reasoning in Language Models (2023) Shao et al. [2023] Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Phuong, M., Hutter, M.: Formal algorithms for transformers. arXiv preprint arXiv:2207.09238 (2022) Ji et al. [2023] Ji, Z., Lee, N., Frieske, R., Yu, T., Su, D., Xu, Y., Ishii, E., Bang, Y.J., Madotto, A., Fung, P.: Survey of hallucination in natural language generation. ACM Comput. Surv. 55(12) (2023) https://doi.org/10.1145/3571730 Strubell et al. [2020] Strubell, E., Ganesh, A., McCallum, A.: Energy and policy considerations for modern deep learning research. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, pp. 13693–13696 (2020) Patterson et al. [2021] Patterson, D., Gonzalez, J., Le, Q., Liang, C., Munguia, L.-M., Rothchild, D., So, D., Texier, M., Dean, J.: Carbon emissions and large neural network training. arXiv preprint arXiv:2104.10350 (2021) Wu et al. [2022] Wu, C.-J., Raghavendra, R., Gupta, U., Acun, B., Ardalani, N., Maeng, K., Chang, G., Aga, F., Huang, J., Bai, C., et al.: Sustainable ai: Environmental implications, challenges and opportunities. Proceedings of Machine Learning and Systems 4, 795–813 (2022) Kaack et al. [2022] Kaack, L.H., Donti, P.L., Strubell, E., Kamiya, G., Creutzig, F., Rolnick, D.: Aligning artificial intelligence with climate change mitigation. Nature Climate Change 12(6), 518–527 (2022) Zhou et al. [2022] Zhou, H., Nova, A., Larochelle, H., Courville, A., Neyshabur, B., Sedghi, H.: Teaching algorithmic reasoning via in-context learning. arXiv preprint arXiv:2211.09066 (2022) Izacard et al. [2022] Izacard, G., Lewis, P., Lomeli, M., Hosseini, L., Petroni, F., Schick, T., Dwivedi-Yu, J., Joulin, A., Riedel, S., Grave, E.: Few-shot learning with retrieval augmented language models. arXiv preprint arXiv:2208.03299 (2022) Wei et al. [2022] Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E., Le, Q., Zhou, D.: Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903 (2022) Wang et al. [2023] Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-Consistency Improves Chain of Thought Reasoning in Language Models (2023) Shao et al. [2023] Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Ji, Z., Lee, N., Frieske, R., Yu, T., Su, D., Xu, Y., Ishii, E., Bang, Y.J., Madotto, A., Fung, P.: Survey of hallucination in natural language generation. ACM Comput. Surv. 55(12) (2023) https://doi.org/10.1145/3571730 Strubell et al. [2020] Strubell, E., Ganesh, A., McCallum, A.: Energy and policy considerations for modern deep learning research. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, pp. 13693–13696 (2020) Patterson et al. [2021] Patterson, D., Gonzalez, J., Le, Q., Liang, C., Munguia, L.-M., Rothchild, D., So, D., Texier, M., Dean, J.: Carbon emissions and large neural network training. arXiv preprint arXiv:2104.10350 (2021) Wu et al. [2022] Wu, C.-J., Raghavendra, R., Gupta, U., Acun, B., Ardalani, N., Maeng, K., Chang, G., Aga, F., Huang, J., Bai, C., et al.: Sustainable ai: Environmental implications, challenges and opportunities. Proceedings of Machine Learning and Systems 4, 795–813 (2022) Kaack et al. [2022] Kaack, L.H., Donti, P.L., Strubell, E., Kamiya, G., Creutzig, F., Rolnick, D.: Aligning artificial intelligence with climate change mitigation. Nature Climate Change 12(6), 518–527 (2022) Zhou et al. [2022] Zhou, H., Nova, A., Larochelle, H., Courville, A., Neyshabur, B., Sedghi, H.: Teaching algorithmic reasoning via in-context learning. arXiv preprint arXiv:2211.09066 (2022) Izacard et al. [2022] Izacard, G., Lewis, P., Lomeli, M., Hosseini, L., Petroni, F., Schick, T., Dwivedi-Yu, J., Joulin, A., Riedel, S., Grave, E.: Few-shot learning with retrieval augmented language models. arXiv preprint arXiv:2208.03299 (2022) Wei et al. [2022] Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E., Le, Q., Zhou, D.: Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903 (2022) Wang et al. [2023] Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-Consistency Improves Chain of Thought Reasoning in Language Models (2023) Shao et al. [2023] Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Strubell, E., Ganesh, A., McCallum, A.: Energy and policy considerations for modern deep learning research. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, pp. 13693–13696 (2020) Patterson et al. [2021] Patterson, D., Gonzalez, J., Le, Q., Liang, C., Munguia, L.-M., Rothchild, D., So, D., Texier, M., Dean, J.: Carbon emissions and large neural network training. arXiv preprint arXiv:2104.10350 (2021) Wu et al. [2022] Wu, C.-J., Raghavendra, R., Gupta, U., Acun, B., Ardalani, N., Maeng, K., Chang, G., Aga, F., Huang, J., Bai, C., et al.: Sustainable ai: Environmental implications, challenges and opportunities. Proceedings of Machine Learning and Systems 4, 795–813 (2022) Kaack et al. [2022] Kaack, L.H., Donti, P.L., Strubell, E., Kamiya, G., Creutzig, F., Rolnick, D.: Aligning artificial intelligence with climate change mitigation. Nature Climate Change 12(6), 518–527 (2022) Zhou et al. [2022] Zhou, H., Nova, A., Larochelle, H., Courville, A., Neyshabur, B., Sedghi, H.: Teaching algorithmic reasoning via in-context learning. arXiv preprint arXiv:2211.09066 (2022) Izacard et al. [2022] Izacard, G., Lewis, P., Lomeli, M., Hosseini, L., Petroni, F., Schick, T., Dwivedi-Yu, J., Joulin, A., Riedel, S., Grave, E.: Few-shot learning with retrieval augmented language models. arXiv preprint arXiv:2208.03299 (2022) Wei et al. [2022] Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E., Le, Q., Zhou, D.: Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903 (2022) Wang et al. [2023] Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-Consistency Improves Chain of Thought Reasoning in Language Models (2023) Shao et al. [2023] Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Patterson, D., Gonzalez, J., Le, Q., Liang, C., Munguia, L.-M., Rothchild, D., So, D., Texier, M., Dean, J.: Carbon emissions and large neural network training. arXiv preprint arXiv:2104.10350 (2021) Wu et al. [2022] Wu, C.-J., Raghavendra, R., Gupta, U., Acun, B., Ardalani, N., Maeng, K., Chang, G., Aga, F., Huang, J., Bai, C., et al.: Sustainable ai: Environmental implications, challenges and opportunities. Proceedings of Machine Learning and Systems 4, 795–813 (2022) Kaack et al. [2022] Kaack, L.H., Donti, P.L., Strubell, E., Kamiya, G., Creutzig, F., Rolnick, D.: Aligning artificial intelligence with climate change mitigation. Nature Climate Change 12(6), 518–527 (2022) Zhou et al. [2022] Zhou, H., Nova, A., Larochelle, H., Courville, A., Neyshabur, B., Sedghi, H.: Teaching algorithmic reasoning via in-context learning. arXiv preprint arXiv:2211.09066 (2022) Izacard et al. [2022] Izacard, G., Lewis, P., Lomeli, M., Hosseini, L., Petroni, F., Schick, T., Dwivedi-Yu, J., Joulin, A., Riedel, S., Grave, E.: Few-shot learning with retrieval augmented language models. arXiv preprint arXiv:2208.03299 (2022) Wei et al. [2022] Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E., Le, Q., Zhou, D.: Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903 (2022) Wang et al. [2023] Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-Consistency Improves Chain of Thought Reasoning in Language Models (2023) Shao et al. [2023] Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Wu, C.-J., Raghavendra, R., Gupta, U., Acun, B., Ardalani, N., Maeng, K., Chang, G., Aga, F., Huang, J., Bai, C., et al.: Sustainable ai: Environmental implications, challenges and opportunities. Proceedings of Machine Learning and Systems 4, 795–813 (2022) Kaack et al. [2022] Kaack, L.H., Donti, P.L., Strubell, E., Kamiya, G., Creutzig, F., Rolnick, D.: Aligning artificial intelligence with climate change mitigation. Nature Climate Change 12(6), 518–527 (2022) Zhou et al. [2022] Zhou, H., Nova, A., Larochelle, H., Courville, A., Neyshabur, B., Sedghi, H.: Teaching algorithmic reasoning via in-context learning. arXiv preprint arXiv:2211.09066 (2022) Izacard et al. [2022] Izacard, G., Lewis, P., Lomeli, M., Hosseini, L., Petroni, F., Schick, T., Dwivedi-Yu, J., Joulin, A., Riedel, S., Grave, E.: Few-shot learning with retrieval augmented language models. arXiv preprint arXiv:2208.03299 (2022) Wei et al. [2022] Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E., Le, Q., Zhou, D.: Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903 (2022) Wang et al. [2023] Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-Consistency Improves Chain of Thought Reasoning in Language Models (2023) Shao et al. [2023] Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Kaack, L.H., Donti, P.L., Strubell, E., Kamiya, G., Creutzig, F., Rolnick, D.: Aligning artificial intelligence with climate change mitigation. Nature Climate Change 12(6), 518–527 (2022) Zhou et al. [2022] Zhou, H., Nova, A., Larochelle, H., Courville, A., Neyshabur, B., Sedghi, H.: Teaching algorithmic reasoning via in-context learning. arXiv preprint arXiv:2211.09066 (2022) Izacard et al. [2022] Izacard, G., Lewis, P., Lomeli, M., Hosseini, L., Petroni, F., Schick, T., Dwivedi-Yu, J., Joulin, A., Riedel, S., Grave, E.: Few-shot learning with retrieval augmented language models. arXiv preprint arXiv:2208.03299 (2022) Wei et al. [2022] Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E., Le, Q., Zhou, D.: Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903 (2022) Wang et al. [2023] Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-Consistency Improves Chain of Thought Reasoning in Language Models (2023) Shao et al. [2023] Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Zhou, H., Nova, A., Larochelle, H., Courville, A., Neyshabur, B., Sedghi, H.: Teaching algorithmic reasoning via in-context learning. arXiv preprint arXiv:2211.09066 (2022) Izacard et al. [2022] Izacard, G., Lewis, P., Lomeli, M., Hosseini, L., Petroni, F., Schick, T., Dwivedi-Yu, J., Joulin, A., Riedel, S., Grave, E.: Few-shot learning with retrieval augmented language models. arXiv preprint arXiv:2208.03299 (2022) Wei et al. [2022] Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E., Le, Q., Zhou, D.: Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903 (2022) Wang et al. [2023] Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-Consistency Improves Chain of Thought Reasoning in Language Models (2023) Shao et al. [2023] Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Izacard, G., Lewis, P., Lomeli, M., Hosseini, L., Petroni, F., Schick, T., Dwivedi-Yu, J., Joulin, A., Riedel, S., Grave, E.: Few-shot learning with retrieval augmented language models. arXiv preprint arXiv:2208.03299 (2022) Wei et al. [2022] Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E., Le, Q., Zhou, D.: Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903 (2022) Wang et al. [2023] Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-Consistency Improves Chain of Thought Reasoning in Language Models (2023) Shao et al. [2023] Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E., Le, Q., Zhou, D.: Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903 (2022) Wang et al. [2023] Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-Consistency Improves Chain of Thought Reasoning in Language Models (2023) Shao et al. [2023] Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-Consistency Improves Chain of Thought Reasoning in Language Models (2023) Shao et al. [2023] Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023)
  3. Chen, A., Dohan, D.M., So, D.R.: Evoprompting: Language models for code-level neural architecture search. arXiv preprint arXiv:2302.14838 (2023) Liventsev et al. [2023] Liventsev, V., Grishina, A., Härmä, A., Moonen, L.: Fully autonomous programming with large language models. arXiv preprint arXiv:2304.10423 (2023) O’Neill et al. [2010] O’Neill, M., Vanneschi, L., Gustafson, S., Banzhaf, W.: Open issues in genetic programming. Genetic Programming and Evolvable Machines 11, 339–363 (2010) O’Neill and Spector [2020] O’Neill, M., Spector, L.: Automatic programming: The open issue? Genetic Programming and Evolvable Machines 21, 251–262 (2020) Liu et al. [2023] Liu, P., Yuan, W., Fu, J., Jiang, Z., Hayashi, H., Neubig, G.: Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing. ACM Computing Surveys 55(9), 1–35 (2023) Radford et al. [2019] Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., Sutskever, I., et al.: Language models are unsupervised multitask learners. OpenAI blog 1(8), 9 (2019) Brown et al. [2020] Brown, T.B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D.M., Wu, J., Winter, C., Hesse, C., Chen, M., Sigler, E., Litwin, M., Gray, S., Chess, B., Clark, J., Berner, C., McCandlish, S., Radford, A., Sutskever, I., Amodei, D.: Language Models are Few-Shot Learners (2020) OpenAI [2023] OpenAI: GPT-4 Technical Report (2023) Phuong and Hutter [2022] Phuong, M., Hutter, M.: Formal algorithms for transformers. arXiv preprint arXiv:2207.09238 (2022) Ji et al. [2023] Ji, Z., Lee, N., Frieske, R., Yu, T., Su, D., Xu, Y., Ishii, E., Bang, Y.J., Madotto, A., Fung, P.: Survey of hallucination in natural language generation. ACM Comput. Surv. 55(12) (2023) https://doi.org/10.1145/3571730 Strubell et al. [2020] Strubell, E., Ganesh, A., McCallum, A.: Energy and policy considerations for modern deep learning research. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, pp. 13693–13696 (2020) Patterson et al. [2021] Patterson, D., Gonzalez, J., Le, Q., Liang, C., Munguia, L.-M., Rothchild, D., So, D., Texier, M., Dean, J.: Carbon emissions and large neural network training. arXiv preprint arXiv:2104.10350 (2021) Wu et al. [2022] Wu, C.-J., Raghavendra, R., Gupta, U., Acun, B., Ardalani, N., Maeng, K., Chang, G., Aga, F., Huang, J., Bai, C., et al.: Sustainable ai: Environmental implications, challenges and opportunities. Proceedings of Machine Learning and Systems 4, 795–813 (2022) Kaack et al. [2022] Kaack, L.H., Donti, P.L., Strubell, E., Kamiya, G., Creutzig, F., Rolnick, D.: Aligning artificial intelligence with climate change mitigation. Nature Climate Change 12(6), 518–527 (2022) Zhou et al. [2022] Zhou, H., Nova, A., Larochelle, H., Courville, A., Neyshabur, B., Sedghi, H.: Teaching algorithmic reasoning via in-context learning. arXiv preprint arXiv:2211.09066 (2022) Izacard et al. [2022] Izacard, G., Lewis, P., Lomeli, M., Hosseini, L., Petroni, F., Schick, T., Dwivedi-Yu, J., Joulin, A., Riedel, S., Grave, E.: Few-shot learning with retrieval augmented language models. arXiv preprint arXiv:2208.03299 (2022) Wei et al. [2022] Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E., Le, Q., Zhou, D.: Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903 (2022) Wang et al. [2023] Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-Consistency Improves Chain of Thought Reasoning in Language Models (2023) Shao et al. [2023] Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Liventsev, V., Grishina, A., Härmä, A., Moonen, L.: Fully autonomous programming with large language models. arXiv preprint arXiv:2304.10423 (2023) O’Neill et al. [2010] O’Neill, M., Vanneschi, L., Gustafson, S., Banzhaf, W.: Open issues in genetic programming. Genetic Programming and Evolvable Machines 11, 339–363 (2010) O’Neill and Spector [2020] O’Neill, M., Spector, L.: Automatic programming: The open issue? Genetic Programming and Evolvable Machines 21, 251–262 (2020) Liu et al. [2023] Liu, P., Yuan, W., Fu, J., Jiang, Z., Hayashi, H., Neubig, G.: Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing. ACM Computing Surveys 55(9), 1–35 (2023) Radford et al. [2019] Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., Sutskever, I., et al.: Language models are unsupervised multitask learners. OpenAI blog 1(8), 9 (2019) Brown et al. [2020] Brown, T.B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D.M., Wu, J., Winter, C., Hesse, C., Chen, M., Sigler, E., Litwin, M., Gray, S., Chess, B., Clark, J., Berner, C., McCandlish, S., Radford, A., Sutskever, I., Amodei, D.: Language Models are Few-Shot Learners (2020) OpenAI [2023] OpenAI: GPT-4 Technical Report (2023) Phuong and Hutter [2022] Phuong, M., Hutter, M.: Formal algorithms for transformers. arXiv preprint arXiv:2207.09238 (2022) Ji et al. [2023] Ji, Z., Lee, N., Frieske, R., Yu, T., Su, D., Xu, Y., Ishii, E., Bang, Y.J., Madotto, A., Fung, P.: Survey of hallucination in natural language generation. ACM Comput. Surv. 55(12) (2023) https://doi.org/10.1145/3571730 Strubell et al. [2020] Strubell, E., Ganesh, A., McCallum, A.: Energy and policy considerations for modern deep learning research. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, pp. 13693–13696 (2020) Patterson et al. [2021] Patterson, D., Gonzalez, J., Le, Q., Liang, C., Munguia, L.-M., Rothchild, D., So, D., Texier, M., Dean, J.: Carbon emissions and large neural network training. arXiv preprint arXiv:2104.10350 (2021) Wu et al. [2022] Wu, C.-J., Raghavendra, R., Gupta, U., Acun, B., Ardalani, N., Maeng, K., Chang, G., Aga, F., Huang, J., Bai, C., et al.: Sustainable ai: Environmental implications, challenges and opportunities. Proceedings of Machine Learning and Systems 4, 795–813 (2022) Kaack et al. [2022] Kaack, L.H., Donti, P.L., Strubell, E., Kamiya, G., Creutzig, F., Rolnick, D.: Aligning artificial intelligence with climate change mitigation. Nature Climate Change 12(6), 518–527 (2022) Zhou et al. [2022] Zhou, H., Nova, A., Larochelle, H., Courville, A., Neyshabur, B., Sedghi, H.: Teaching algorithmic reasoning via in-context learning. arXiv preprint arXiv:2211.09066 (2022) Izacard et al. [2022] Izacard, G., Lewis, P., Lomeli, M., Hosseini, L., Petroni, F., Schick, T., Dwivedi-Yu, J., Joulin, A., Riedel, S., Grave, E.: Few-shot learning with retrieval augmented language models. arXiv preprint arXiv:2208.03299 (2022) Wei et al. [2022] Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E., Le, Q., Zhou, D.: Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903 (2022) Wang et al. [2023] Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-Consistency Improves Chain of Thought Reasoning in Language Models (2023) Shao et al. [2023] Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) O’Neill, M., Vanneschi, L., Gustafson, S., Banzhaf, W.: Open issues in genetic programming. Genetic Programming and Evolvable Machines 11, 339–363 (2010) O’Neill and Spector [2020] O’Neill, M., Spector, L.: Automatic programming: The open issue? Genetic Programming and Evolvable Machines 21, 251–262 (2020) Liu et al. [2023] Liu, P., Yuan, W., Fu, J., Jiang, Z., Hayashi, H., Neubig, G.: Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing. ACM Computing Surveys 55(9), 1–35 (2023) Radford et al. [2019] Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., Sutskever, I., et al.: Language models are unsupervised multitask learners. OpenAI blog 1(8), 9 (2019) Brown et al. [2020] Brown, T.B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D.M., Wu, J., Winter, C., Hesse, C., Chen, M., Sigler, E., Litwin, M., Gray, S., Chess, B., Clark, J., Berner, C., McCandlish, S., Radford, A., Sutskever, I., Amodei, D.: Language Models are Few-Shot Learners (2020) OpenAI [2023] OpenAI: GPT-4 Technical Report (2023) Phuong and Hutter [2022] Phuong, M., Hutter, M.: Formal algorithms for transformers. arXiv preprint arXiv:2207.09238 (2022) Ji et al. [2023] Ji, Z., Lee, N., Frieske, R., Yu, T., Su, D., Xu, Y., Ishii, E., Bang, Y.J., Madotto, A., Fung, P.: Survey of hallucination in natural language generation. ACM Comput. Surv. 55(12) (2023) https://doi.org/10.1145/3571730 Strubell et al. [2020] Strubell, E., Ganesh, A., McCallum, A.: Energy and policy considerations for modern deep learning research. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, pp. 13693–13696 (2020) Patterson et al. [2021] Patterson, D., Gonzalez, J., Le, Q., Liang, C., Munguia, L.-M., Rothchild, D., So, D., Texier, M., Dean, J.: Carbon emissions and large neural network training. arXiv preprint arXiv:2104.10350 (2021) Wu et al. [2022] Wu, C.-J., Raghavendra, R., Gupta, U., Acun, B., Ardalani, N., Maeng, K., Chang, G., Aga, F., Huang, J., Bai, C., et al.: Sustainable ai: Environmental implications, challenges and opportunities. Proceedings of Machine Learning and Systems 4, 795–813 (2022) Kaack et al. [2022] Kaack, L.H., Donti, P.L., Strubell, E., Kamiya, G., Creutzig, F., Rolnick, D.: Aligning artificial intelligence with climate change mitigation. Nature Climate Change 12(6), 518–527 (2022) Zhou et al. [2022] Zhou, H., Nova, A., Larochelle, H., Courville, A., Neyshabur, B., Sedghi, H.: Teaching algorithmic reasoning via in-context learning. arXiv preprint arXiv:2211.09066 (2022) Izacard et al. [2022] Izacard, G., Lewis, P., Lomeli, M., Hosseini, L., Petroni, F., Schick, T., Dwivedi-Yu, J., Joulin, A., Riedel, S., Grave, E.: Few-shot learning with retrieval augmented language models. arXiv preprint arXiv:2208.03299 (2022) Wei et al. [2022] Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E., Le, Q., Zhou, D.: Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903 (2022) Wang et al. [2023] Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-Consistency Improves Chain of Thought Reasoning in Language Models (2023) Shao et al. [2023] Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) O’Neill, M., Spector, L.: Automatic programming: The open issue? Genetic Programming and Evolvable Machines 21, 251–262 (2020) Liu et al. [2023] Liu, P., Yuan, W., Fu, J., Jiang, Z., Hayashi, H., Neubig, G.: Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing. ACM Computing Surveys 55(9), 1–35 (2023) Radford et al. [2019] Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., Sutskever, I., et al.: Language models are unsupervised multitask learners. OpenAI blog 1(8), 9 (2019) Brown et al. [2020] Brown, T.B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D.M., Wu, J., Winter, C., Hesse, C., Chen, M., Sigler, E., Litwin, M., Gray, S., Chess, B., Clark, J., Berner, C., McCandlish, S., Radford, A., Sutskever, I., Amodei, D.: Language Models are Few-Shot Learners (2020) OpenAI [2023] OpenAI: GPT-4 Technical Report (2023) Phuong and Hutter [2022] Phuong, M., Hutter, M.: Formal algorithms for transformers. arXiv preprint arXiv:2207.09238 (2022) Ji et al. [2023] Ji, Z., Lee, N., Frieske, R., Yu, T., Su, D., Xu, Y., Ishii, E., Bang, Y.J., Madotto, A., Fung, P.: Survey of hallucination in natural language generation. ACM Comput. Surv. 55(12) (2023) https://doi.org/10.1145/3571730 Strubell et al. [2020] Strubell, E., Ganesh, A., McCallum, A.: Energy and policy considerations for modern deep learning research. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, pp. 13693–13696 (2020) Patterson et al. [2021] Patterson, D., Gonzalez, J., Le, Q., Liang, C., Munguia, L.-M., Rothchild, D., So, D., Texier, M., Dean, J.: Carbon emissions and large neural network training. arXiv preprint arXiv:2104.10350 (2021) Wu et al. [2022] Wu, C.-J., Raghavendra, R., Gupta, U., Acun, B., Ardalani, N., Maeng, K., Chang, G., Aga, F., Huang, J., Bai, C., et al.: Sustainable ai: Environmental implications, challenges and opportunities. Proceedings of Machine Learning and Systems 4, 795–813 (2022) Kaack et al. [2022] Kaack, L.H., Donti, P.L., Strubell, E., Kamiya, G., Creutzig, F., Rolnick, D.: Aligning artificial intelligence with climate change mitigation. Nature Climate Change 12(6), 518–527 (2022) Zhou et al. [2022] Zhou, H., Nova, A., Larochelle, H., Courville, A., Neyshabur, B., Sedghi, H.: Teaching algorithmic reasoning via in-context learning. arXiv preprint arXiv:2211.09066 (2022) Izacard et al. [2022] Izacard, G., Lewis, P., Lomeli, M., Hosseini, L., Petroni, F., Schick, T., Dwivedi-Yu, J., Joulin, A., Riedel, S., Grave, E.: Few-shot learning with retrieval augmented language models. arXiv preprint arXiv:2208.03299 (2022) Wei et al. [2022] Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E., Le, Q., Zhou, D.: Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903 (2022) Wang et al. [2023] Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-Consistency Improves Chain of Thought Reasoning in Language Models (2023) Shao et al. [2023] Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Liu, P., Yuan, W., Fu, J., Jiang, Z., Hayashi, H., Neubig, G.: Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing. ACM Computing Surveys 55(9), 1–35 (2023) Radford et al. [2019] Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., Sutskever, I., et al.: Language models are unsupervised multitask learners. OpenAI blog 1(8), 9 (2019) Brown et al. [2020] Brown, T.B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D.M., Wu, J., Winter, C., Hesse, C., Chen, M., Sigler, E., Litwin, M., Gray, S., Chess, B., Clark, J., Berner, C., McCandlish, S., Radford, A., Sutskever, I., Amodei, D.: Language Models are Few-Shot Learners (2020) OpenAI [2023] OpenAI: GPT-4 Technical Report (2023) Phuong and Hutter [2022] Phuong, M., Hutter, M.: Formal algorithms for transformers. arXiv preprint arXiv:2207.09238 (2022) Ji et al. [2023] Ji, Z., Lee, N., Frieske, R., Yu, T., Su, D., Xu, Y., Ishii, E., Bang, Y.J., Madotto, A., Fung, P.: Survey of hallucination in natural language generation. ACM Comput. Surv. 55(12) (2023) https://doi.org/10.1145/3571730 Strubell et al. [2020] Strubell, E., Ganesh, A., McCallum, A.: Energy and policy considerations for modern deep learning research. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, pp. 13693–13696 (2020) Patterson et al. [2021] Patterson, D., Gonzalez, J., Le, Q., Liang, C., Munguia, L.-M., Rothchild, D., So, D., Texier, M., Dean, J.: Carbon emissions and large neural network training. arXiv preprint arXiv:2104.10350 (2021) Wu et al. [2022] Wu, C.-J., Raghavendra, R., Gupta, U., Acun, B., Ardalani, N., Maeng, K., Chang, G., Aga, F., Huang, J., Bai, C., et al.: Sustainable ai: Environmental implications, challenges and opportunities. Proceedings of Machine Learning and Systems 4, 795–813 (2022) Kaack et al. [2022] Kaack, L.H., Donti, P.L., Strubell, E., Kamiya, G., Creutzig, F., Rolnick, D.: Aligning artificial intelligence with climate change mitigation. Nature Climate Change 12(6), 518–527 (2022) Zhou et al. [2022] Zhou, H., Nova, A., Larochelle, H., Courville, A., Neyshabur, B., Sedghi, H.: Teaching algorithmic reasoning via in-context learning. arXiv preprint arXiv:2211.09066 (2022) Izacard et al. [2022] Izacard, G., Lewis, P., Lomeli, M., Hosseini, L., Petroni, F., Schick, T., Dwivedi-Yu, J., Joulin, A., Riedel, S., Grave, E.: Few-shot learning with retrieval augmented language models. arXiv preprint arXiv:2208.03299 (2022) Wei et al. [2022] Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E., Le, Q., Zhou, D.: Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903 (2022) Wang et al. [2023] Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-Consistency Improves Chain of Thought Reasoning in Language Models (2023) Shao et al. [2023] Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., Sutskever, I., et al.: Language models are unsupervised multitask learners. OpenAI blog 1(8), 9 (2019) Brown et al. [2020] Brown, T.B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D.M., Wu, J., Winter, C., Hesse, C., Chen, M., Sigler, E., Litwin, M., Gray, S., Chess, B., Clark, J., Berner, C., McCandlish, S., Radford, A., Sutskever, I., Amodei, D.: Language Models are Few-Shot Learners (2020) OpenAI [2023] OpenAI: GPT-4 Technical Report (2023) Phuong and Hutter [2022] Phuong, M., Hutter, M.: Formal algorithms for transformers. arXiv preprint arXiv:2207.09238 (2022) Ji et al. [2023] Ji, Z., Lee, N., Frieske, R., Yu, T., Su, D., Xu, Y., Ishii, E., Bang, Y.J., Madotto, A., Fung, P.: Survey of hallucination in natural language generation. ACM Comput. Surv. 55(12) (2023) https://doi.org/10.1145/3571730 Strubell et al. [2020] Strubell, E., Ganesh, A., McCallum, A.: Energy and policy considerations for modern deep learning research. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, pp. 13693–13696 (2020) Patterson et al. [2021] Patterson, D., Gonzalez, J., Le, Q., Liang, C., Munguia, L.-M., Rothchild, D., So, D., Texier, M., Dean, J.: Carbon emissions and large neural network training. arXiv preprint arXiv:2104.10350 (2021) Wu et al. [2022] Wu, C.-J., Raghavendra, R., Gupta, U., Acun, B., Ardalani, N., Maeng, K., Chang, G., Aga, F., Huang, J., Bai, C., et al.: Sustainable ai: Environmental implications, challenges and opportunities. Proceedings of Machine Learning and Systems 4, 795–813 (2022) Kaack et al. [2022] Kaack, L.H., Donti, P.L., Strubell, E., Kamiya, G., Creutzig, F., Rolnick, D.: Aligning artificial intelligence with climate change mitigation. Nature Climate Change 12(6), 518–527 (2022) Zhou et al. [2022] Zhou, H., Nova, A., Larochelle, H., Courville, A., Neyshabur, B., Sedghi, H.: Teaching algorithmic reasoning via in-context learning. arXiv preprint arXiv:2211.09066 (2022) Izacard et al. [2022] Izacard, G., Lewis, P., Lomeli, M., Hosseini, L., Petroni, F., Schick, T., Dwivedi-Yu, J., Joulin, A., Riedel, S., Grave, E.: Few-shot learning with retrieval augmented language models. arXiv preprint arXiv:2208.03299 (2022) Wei et al. [2022] Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E., Le, Q., Zhou, D.: Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903 (2022) Wang et al. [2023] Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-Consistency Improves Chain of Thought Reasoning in Language Models (2023) Shao et al. [2023] Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Brown, T.B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D.M., Wu, J., Winter, C., Hesse, C., Chen, M., Sigler, E., Litwin, M., Gray, S., Chess, B., Clark, J., Berner, C., McCandlish, S., Radford, A., Sutskever, I., Amodei, D.: Language Models are Few-Shot Learners (2020) OpenAI [2023] OpenAI: GPT-4 Technical Report (2023) Phuong and Hutter [2022] Phuong, M., Hutter, M.: Formal algorithms for transformers. arXiv preprint arXiv:2207.09238 (2022) Ji et al. [2023] Ji, Z., Lee, N., Frieske, R., Yu, T., Su, D., Xu, Y., Ishii, E., Bang, Y.J., Madotto, A., Fung, P.: Survey of hallucination in natural language generation. ACM Comput. Surv. 55(12) (2023) https://doi.org/10.1145/3571730 Strubell et al. [2020] Strubell, E., Ganesh, A., McCallum, A.: Energy and policy considerations for modern deep learning research. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, pp. 13693–13696 (2020) Patterson et al. [2021] Patterson, D., Gonzalez, J., Le, Q., Liang, C., Munguia, L.-M., Rothchild, D., So, D., Texier, M., Dean, J.: Carbon emissions and large neural network training. arXiv preprint arXiv:2104.10350 (2021) Wu et al. [2022] Wu, C.-J., Raghavendra, R., Gupta, U., Acun, B., Ardalani, N., Maeng, K., Chang, G., Aga, F., Huang, J., Bai, C., et al.: Sustainable ai: Environmental implications, challenges and opportunities. Proceedings of Machine Learning and Systems 4, 795–813 (2022) Kaack et al. [2022] Kaack, L.H., Donti, P.L., Strubell, E., Kamiya, G., Creutzig, F., Rolnick, D.: Aligning artificial intelligence with climate change mitigation. Nature Climate Change 12(6), 518–527 (2022) Zhou et al. [2022] Zhou, H., Nova, A., Larochelle, H., Courville, A., Neyshabur, B., Sedghi, H.: Teaching algorithmic reasoning via in-context learning. arXiv preprint arXiv:2211.09066 (2022) Izacard et al. [2022] Izacard, G., Lewis, P., Lomeli, M., Hosseini, L., Petroni, F., Schick, T., Dwivedi-Yu, J., Joulin, A., Riedel, S., Grave, E.: Few-shot learning with retrieval augmented language models. arXiv preprint arXiv:2208.03299 (2022) Wei et al. [2022] Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E., Le, Q., Zhou, D.: Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903 (2022) Wang et al. [2023] Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-Consistency Improves Chain of Thought Reasoning in Language Models (2023) Shao et al. [2023] Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) OpenAI: GPT-4 Technical Report (2023) Phuong and Hutter [2022] Phuong, M., Hutter, M.: Formal algorithms for transformers. arXiv preprint arXiv:2207.09238 (2022) Ji et al. [2023] Ji, Z., Lee, N., Frieske, R., Yu, T., Su, D., Xu, Y., Ishii, E., Bang, Y.J., Madotto, A., Fung, P.: Survey of hallucination in natural language generation. ACM Comput. Surv. 55(12) (2023) https://doi.org/10.1145/3571730 Strubell et al. [2020] Strubell, E., Ganesh, A., McCallum, A.: Energy and policy considerations for modern deep learning research. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, pp. 13693–13696 (2020) Patterson et al. [2021] Patterson, D., Gonzalez, J., Le, Q., Liang, C., Munguia, L.-M., Rothchild, D., So, D., Texier, M., Dean, J.: Carbon emissions and large neural network training. arXiv preprint arXiv:2104.10350 (2021) Wu et al. [2022] Wu, C.-J., Raghavendra, R., Gupta, U., Acun, B., Ardalani, N., Maeng, K., Chang, G., Aga, F., Huang, J., Bai, C., et al.: Sustainable ai: Environmental implications, challenges and opportunities. Proceedings of Machine Learning and Systems 4, 795–813 (2022) Kaack et al. [2022] Kaack, L.H., Donti, P.L., Strubell, E., Kamiya, G., Creutzig, F., Rolnick, D.: Aligning artificial intelligence with climate change mitigation. Nature Climate Change 12(6), 518–527 (2022) Zhou et al. [2022] Zhou, H., Nova, A., Larochelle, H., Courville, A., Neyshabur, B., Sedghi, H.: Teaching algorithmic reasoning via in-context learning. arXiv preprint arXiv:2211.09066 (2022) Izacard et al. [2022] Izacard, G., Lewis, P., Lomeli, M., Hosseini, L., Petroni, F., Schick, T., Dwivedi-Yu, J., Joulin, A., Riedel, S., Grave, E.: Few-shot learning with retrieval augmented language models. arXiv preprint arXiv:2208.03299 (2022) Wei et al. [2022] Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E., Le, Q., Zhou, D.: Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903 (2022) Wang et al. [2023] Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-Consistency Improves Chain of Thought Reasoning in Language Models (2023) Shao et al. [2023] Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Phuong, M., Hutter, M.: Formal algorithms for transformers. arXiv preprint arXiv:2207.09238 (2022) Ji et al. [2023] Ji, Z., Lee, N., Frieske, R., Yu, T., Su, D., Xu, Y., Ishii, E., Bang, Y.J., Madotto, A., Fung, P.: Survey of hallucination in natural language generation. ACM Comput. Surv. 55(12) (2023) https://doi.org/10.1145/3571730 Strubell et al. [2020] Strubell, E., Ganesh, A., McCallum, A.: Energy and policy considerations for modern deep learning research. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, pp. 13693–13696 (2020) Patterson et al. [2021] Patterson, D., Gonzalez, J., Le, Q., Liang, C., Munguia, L.-M., Rothchild, D., So, D., Texier, M., Dean, J.: Carbon emissions and large neural network training. arXiv preprint arXiv:2104.10350 (2021) Wu et al. [2022] Wu, C.-J., Raghavendra, R., Gupta, U., Acun, B., Ardalani, N., Maeng, K., Chang, G., Aga, F., Huang, J., Bai, C., et al.: Sustainable ai: Environmental implications, challenges and opportunities. Proceedings of Machine Learning and Systems 4, 795–813 (2022) Kaack et al. [2022] Kaack, L.H., Donti, P.L., Strubell, E., Kamiya, G., Creutzig, F., Rolnick, D.: Aligning artificial intelligence with climate change mitigation. Nature Climate Change 12(6), 518–527 (2022) Zhou et al. [2022] Zhou, H., Nova, A., Larochelle, H., Courville, A., Neyshabur, B., Sedghi, H.: Teaching algorithmic reasoning via in-context learning. arXiv preprint arXiv:2211.09066 (2022) Izacard et al. [2022] Izacard, G., Lewis, P., Lomeli, M., Hosseini, L., Petroni, F., Schick, T., Dwivedi-Yu, J., Joulin, A., Riedel, S., Grave, E.: Few-shot learning with retrieval augmented language models. arXiv preprint arXiv:2208.03299 (2022) Wei et al. [2022] Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E., Le, Q., Zhou, D.: Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903 (2022) Wang et al. [2023] Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-Consistency Improves Chain of Thought Reasoning in Language Models (2023) Shao et al. [2023] Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Ji, Z., Lee, N., Frieske, R., Yu, T., Su, D., Xu, Y., Ishii, E., Bang, Y.J., Madotto, A., Fung, P.: Survey of hallucination in natural language generation. ACM Comput. Surv. 55(12) (2023) https://doi.org/10.1145/3571730 Strubell et al. [2020] Strubell, E., Ganesh, A., McCallum, A.: Energy and policy considerations for modern deep learning research. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, pp. 13693–13696 (2020) Patterson et al. [2021] Patterson, D., Gonzalez, J., Le, Q., Liang, C., Munguia, L.-M., Rothchild, D., So, D., Texier, M., Dean, J.: Carbon emissions and large neural network training. arXiv preprint arXiv:2104.10350 (2021) Wu et al. [2022] Wu, C.-J., Raghavendra, R., Gupta, U., Acun, B., Ardalani, N., Maeng, K., Chang, G., Aga, F., Huang, J., Bai, C., et al.: Sustainable ai: Environmental implications, challenges and opportunities. Proceedings of Machine Learning and Systems 4, 795–813 (2022) Kaack et al. [2022] Kaack, L.H., Donti, P.L., Strubell, E., Kamiya, G., Creutzig, F., Rolnick, D.: Aligning artificial intelligence with climate change mitigation. Nature Climate Change 12(6), 518–527 (2022) Zhou et al. [2022] Zhou, H., Nova, A., Larochelle, H., Courville, A., Neyshabur, B., Sedghi, H.: Teaching algorithmic reasoning via in-context learning. arXiv preprint arXiv:2211.09066 (2022) Izacard et al. [2022] Izacard, G., Lewis, P., Lomeli, M., Hosseini, L., Petroni, F., Schick, T., Dwivedi-Yu, J., Joulin, A., Riedel, S., Grave, E.: Few-shot learning with retrieval augmented language models. arXiv preprint arXiv:2208.03299 (2022) Wei et al. [2022] Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E., Le, Q., Zhou, D.: Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903 (2022) Wang et al. [2023] Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-Consistency Improves Chain of Thought Reasoning in Language Models (2023) Shao et al. [2023] Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Strubell, E., Ganesh, A., McCallum, A.: Energy and policy considerations for modern deep learning research. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, pp. 13693–13696 (2020) Patterson et al. [2021] Patterson, D., Gonzalez, J., Le, Q., Liang, C., Munguia, L.-M., Rothchild, D., So, D., Texier, M., Dean, J.: Carbon emissions and large neural network training. arXiv preprint arXiv:2104.10350 (2021) Wu et al. [2022] Wu, C.-J., Raghavendra, R., Gupta, U., Acun, B., Ardalani, N., Maeng, K., Chang, G., Aga, F., Huang, J., Bai, C., et al.: Sustainable ai: Environmental implications, challenges and opportunities. Proceedings of Machine Learning and Systems 4, 795–813 (2022) Kaack et al. [2022] Kaack, L.H., Donti, P.L., Strubell, E., Kamiya, G., Creutzig, F., Rolnick, D.: Aligning artificial intelligence with climate change mitigation. Nature Climate Change 12(6), 518–527 (2022) Zhou et al. [2022] Zhou, H., Nova, A., Larochelle, H., Courville, A., Neyshabur, B., Sedghi, H.: Teaching algorithmic reasoning via in-context learning. arXiv preprint arXiv:2211.09066 (2022) Izacard et al. [2022] Izacard, G., Lewis, P., Lomeli, M., Hosseini, L., Petroni, F., Schick, T., Dwivedi-Yu, J., Joulin, A., Riedel, S., Grave, E.: Few-shot learning with retrieval augmented language models. arXiv preprint arXiv:2208.03299 (2022) Wei et al. [2022] Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E., Le, Q., Zhou, D.: Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903 (2022) Wang et al. [2023] Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-Consistency Improves Chain of Thought Reasoning in Language Models (2023) Shao et al. [2023] Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Patterson, D., Gonzalez, J., Le, Q., Liang, C., Munguia, L.-M., Rothchild, D., So, D., Texier, M., Dean, J.: Carbon emissions and large neural network training. arXiv preprint arXiv:2104.10350 (2021) Wu et al. [2022] Wu, C.-J., Raghavendra, R., Gupta, U., Acun, B., Ardalani, N., Maeng, K., Chang, G., Aga, F., Huang, J., Bai, C., et al.: Sustainable ai: Environmental implications, challenges and opportunities. Proceedings of Machine Learning and Systems 4, 795–813 (2022) Kaack et al. [2022] Kaack, L.H., Donti, P.L., Strubell, E., Kamiya, G., Creutzig, F., Rolnick, D.: Aligning artificial intelligence with climate change mitigation. Nature Climate Change 12(6), 518–527 (2022) Zhou et al. [2022] Zhou, H., Nova, A., Larochelle, H., Courville, A., Neyshabur, B., Sedghi, H.: Teaching algorithmic reasoning via in-context learning. arXiv preprint arXiv:2211.09066 (2022) Izacard et al. [2022] Izacard, G., Lewis, P., Lomeli, M., Hosseini, L., Petroni, F., Schick, T., Dwivedi-Yu, J., Joulin, A., Riedel, S., Grave, E.: Few-shot learning with retrieval augmented language models. arXiv preprint arXiv:2208.03299 (2022) Wei et al. [2022] Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E., Le, Q., Zhou, D.: Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903 (2022) Wang et al. [2023] Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-Consistency Improves Chain of Thought Reasoning in Language Models (2023) Shao et al. [2023] Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Wu, C.-J., Raghavendra, R., Gupta, U., Acun, B., Ardalani, N., Maeng, K., Chang, G., Aga, F., Huang, J., Bai, C., et al.: Sustainable ai: Environmental implications, challenges and opportunities. Proceedings of Machine Learning and Systems 4, 795–813 (2022) Kaack et al. [2022] Kaack, L.H., Donti, P.L., Strubell, E., Kamiya, G., Creutzig, F., Rolnick, D.: Aligning artificial intelligence with climate change mitigation. Nature Climate Change 12(6), 518–527 (2022) Zhou et al. [2022] Zhou, H., Nova, A., Larochelle, H., Courville, A., Neyshabur, B., Sedghi, H.: Teaching algorithmic reasoning via in-context learning. arXiv preprint arXiv:2211.09066 (2022) Izacard et al. [2022] Izacard, G., Lewis, P., Lomeli, M., Hosseini, L., Petroni, F., Schick, T., Dwivedi-Yu, J., Joulin, A., Riedel, S., Grave, E.: Few-shot learning with retrieval augmented language models. arXiv preprint arXiv:2208.03299 (2022) Wei et al. [2022] Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E., Le, Q., Zhou, D.: Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903 (2022) Wang et al. [2023] Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-Consistency Improves Chain of Thought Reasoning in Language Models (2023) Shao et al. [2023] Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Kaack, L.H., Donti, P.L., Strubell, E., Kamiya, G., Creutzig, F., Rolnick, D.: Aligning artificial intelligence with climate change mitigation. Nature Climate Change 12(6), 518–527 (2022) Zhou et al. [2022] Zhou, H., Nova, A., Larochelle, H., Courville, A., Neyshabur, B., Sedghi, H.: Teaching algorithmic reasoning via in-context learning. arXiv preprint arXiv:2211.09066 (2022) Izacard et al. [2022] Izacard, G., Lewis, P., Lomeli, M., Hosseini, L., Petroni, F., Schick, T., Dwivedi-Yu, J., Joulin, A., Riedel, S., Grave, E.: Few-shot learning with retrieval augmented language models. arXiv preprint arXiv:2208.03299 (2022) Wei et al. [2022] Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E., Le, Q., Zhou, D.: Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903 (2022) Wang et al. [2023] Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-Consistency Improves Chain of Thought Reasoning in Language Models (2023) Shao et al. [2023] Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Zhou, H., Nova, A., Larochelle, H., Courville, A., Neyshabur, B., Sedghi, H.: Teaching algorithmic reasoning via in-context learning. arXiv preprint arXiv:2211.09066 (2022) Izacard et al. [2022] Izacard, G., Lewis, P., Lomeli, M., Hosseini, L., Petroni, F., Schick, T., Dwivedi-Yu, J., Joulin, A., Riedel, S., Grave, E.: Few-shot learning with retrieval augmented language models. arXiv preprint arXiv:2208.03299 (2022) Wei et al. [2022] Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E., Le, Q., Zhou, D.: Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903 (2022) Wang et al. [2023] Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-Consistency Improves Chain of Thought Reasoning in Language Models (2023) Shao et al. [2023] Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Izacard, G., Lewis, P., Lomeli, M., Hosseini, L., Petroni, F., Schick, T., Dwivedi-Yu, J., Joulin, A., Riedel, S., Grave, E.: Few-shot learning with retrieval augmented language models. arXiv preprint arXiv:2208.03299 (2022) Wei et al. [2022] Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E., Le, Q., Zhou, D.: Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903 (2022) Wang et al. [2023] Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-Consistency Improves Chain of Thought Reasoning in Language Models (2023) Shao et al. [2023] Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E., Le, Q., Zhou, D.: Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903 (2022) Wang et al. [2023] Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-Consistency Improves Chain of Thought Reasoning in Language Models (2023) Shao et al. [2023] Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-Consistency Improves Chain of Thought Reasoning in Language Models (2023) Shao et al. [2023] Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023)
  4. Liventsev, V., Grishina, A., Härmä, A., Moonen, L.: Fully autonomous programming with large language models. arXiv preprint arXiv:2304.10423 (2023) O’Neill et al. [2010] O’Neill, M., Vanneschi, L., Gustafson, S., Banzhaf, W.: Open issues in genetic programming. Genetic Programming and Evolvable Machines 11, 339–363 (2010) O’Neill and Spector [2020] O’Neill, M., Spector, L.: Automatic programming: The open issue? Genetic Programming and Evolvable Machines 21, 251–262 (2020) Liu et al. [2023] Liu, P., Yuan, W., Fu, J., Jiang, Z., Hayashi, H., Neubig, G.: Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing. ACM Computing Surveys 55(9), 1–35 (2023) Radford et al. [2019] Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., Sutskever, I., et al.: Language models are unsupervised multitask learners. OpenAI blog 1(8), 9 (2019) Brown et al. [2020] Brown, T.B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D.M., Wu, J., Winter, C., Hesse, C., Chen, M., Sigler, E., Litwin, M., Gray, S., Chess, B., Clark, J., Berner, C., McCandlish, S., Radford, A., Sutskever, I., Amodei, D.: Language Models are Few-Shot Learners (2020) OpenAI [2023] OpenAI: GPT-4 Technical Report (2023) Phuong and Hutter [2022] Phuong, M., Hutter, M.: Formal algorithms for transformers. arXiv preprint arXiv:2207.09238 (2022) Ji et al. [2023] Ji, Z., Lee, N., Frieske, R., Yu, T., Su, D., Xu, Y., Ishii, E., Bang, Y.J., Madotto, A., Fung, P.: Survey of hallucination in natural language generation. ACM Comput. Surv. 55(12) (2023) https://doi.org/10.1145/3571730 Strubell et al. [2020] Strubell, E., Ganesh, A., McCallum, A.: Energy and policy considerations for modern deep learning research. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, pp. 13693–13696 (2020) Patterson et al. [2021] Patterson, D., Gonzalez, J., Le, Q., Liang, C., Munguia, L.-M., Rothchild, D., So, D., Texier, M., Dean, J.: Carbon emissions and large neural network training. arXiv preprint arXiv:2104.10350 (2021) Wu et al. [2022] Wu, C.-J., Raghavendra, R., Gupta, U., Acun, B., Ardalani, N., Maeng, K., Chang, G., Aga, F., Huang, J., Bai, C., et al.: Sustainable ai: Environmental implications, challenges and opportunities. Proceedings of Machine Learning and Systems 4, 795–813 (2022) Kaack et al. [2022] Kaack, L.H., Donti, P.L., Strubell, E., Kamiya, G., Creutzig, F., Rolnick, D.: Aligning artificial intelligence with climate change mitigation. Nature Climate Change 12(6), 518–527 (2022) Zhou et al. [2022] Zhou, H., Nova, A., Larochelle, H., Courville, A., Neyshabur, B., Sedghi, H.: Teaching algorithmic reasoning via in-context learning. arXiv preprint arXiv:2211.09066 (2022) Izacard et al. [2022] Izacard, G., Lewis, P., Lomeli, M., Hosseini, L., Petroni, F., Schick, T., Dwivedi-Yu, J., Joulin, A., Riedel, S., Grave, E.: Few-shot learning with retrieval augmented language models. arXiv preprint arXiv:2208.03299 (2022) Wei et al. [2022] Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E., Le, Q., Zhou, D.: Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903 (2022) Wang et al. [2023] Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-Consistency Improves Chain of Thought Reasoning in Language Models (2023) Shao et al. [2023] Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) O’Neill, M., Vanneschi, L., Gustafson, S., Banzhaf, W.: Open issues in genetic programming. Genetic Programming and Evolvable Machines 11, 339–363 (2010) O’Neill and Spector [2020] O’Neill, M., Spector, L.: Automatic programming: The open issue? Genetic Programming and Evolvable Machines 21, 251–262 (2020) Liu et al. [2023] Liu, P., Yuan, W., Fu, J., Jiang, Z., Hayashi, H., Neubig, G.: Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing. ACM Computing Surveys 55(9), 1–35 (2023) Radford et al. [2019] Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., Sutskever, I., et al.: Language models are unsupervised multitask learners. OpenAI blog 1(8), 9 (2019) Brown et al. [2020] Brown, T.B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D.M., Wu, J., Winter, C., Hesse, C., Chen, M., Sigler, E., Litwin, M., Gray, S., Chess, B., Clark, J., Berner, C., McCandlish, S., Radford, A., Sutskever, I., Amodei, D.: Language Models are Few-Shot Learners (2020) OpenAI [2023] OpenAI: GPT-4 Technical Report (2023) Phuong and Hutter [2022] Phuong, M., Hutter, M.: Formal algorithms for transformers. arXiv preprint arXiv:2207.09238 (2022) Ji et al. [2023] Ji, Z., Lee, N., Frieske, R., Yu, T., Su, D., Xu, Y., Ishii, E., Bang, Y.J., Madotto, A., Fung, P.: Survey of hallucination in natural language generation. ACM Comput. Surv. 55(12) (2023) https://doi.org/10.1145/3571730 Strubell et al. [2020] Strubell, E., Ganesh, A., McCallum, A.: Energy and policy considerations for modern deep learning research. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, pp. 13693–13696 (2020) Patterson et al. [2021] Patterson, D., Gonzalez, J., Le, Q., Liang, C., Munguia, L.-M., Rothchild, D., So, D., Texier, M., Dean, J.: Carbon emissions and large neural network training. arXiv preprint arXiv:2104.10350 (2021) Wu et al. [2022] Wu, C.-J., Raghavendra, R., Gupta, U., Acun, B., Ardalani, N., Maeng, K., Chang, G., Aga, F., Huang, J., Bai, C., et al.: Sustainable ai: Environmental implications, challenges and opportunities. Proceedings of Machine Learning and Systems 4, 795–813 (2022) Kaack et al. [2022] Kaack, L.H., Donti, P.L., Strubell, E., Kamiya, G., Creutzig, F., Rolnick, D.: Aligning artificial intelligence with climate change mitigation. Nature Climate Change 12(6), 518–527 (2022) Zhou et al. [2022] Zhou, H., Nova, A., Larochelle, H., Courville, A., Neyshabur, B., Sedghi, H.: Teaching algorithmic reasoning via in-context learning. arXiv preprint arXiv:2211.09066 (2022) Izacard et al. [2022] Izacard, G., Lewis, P., Lomeli, M., Hosseini, L., Petroni, F., Schick, T., Dwivedi-Yu, J., Joulin, A., Riedel, S., Grave, E.: Few-shot learning with retrieval augmented language models. arXiv preprint arXiv:2208.03299 (2022) Wei et al. [2022] Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E., Le, Q., Zhou, D.: Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903 (2022) Wang et al. [2023] Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-Consistency Improves Chain of Thought Reasoning in Language Models (2023) Shao et al. [2023] Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) O’Neill, M., Spector, L.: Automatic programming: The open issue? Genetic Programming and Evolvable Machines 21, 251–262 (2020) Liu et al. [2023] Liu, P., Yuan, W., Fu, J., Jiang, Z., Hayashi, H., Neubig, G.: Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing. ACM Computing Surveys 55(9), 1–35 (2023) Radford et al. [2019] Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., Sutskever, I., et al.: Language models are unsupervised multitask learners. OpenAI blog 1(8), 9 (2019) Brown et al. [2020] Brown, T.B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D.M., Wu, J., Winter, C., Hesse, C., Chen, M., Sigler, E., Litwin, M., Gray, S., Chess, B., Clark, J., Berner, C., McCandlish, S., Radford, A., Sutskever, I., Amodei, D.: Language Models are Few-Shot Learners (2020) OpenAI [2023] OpenAI: GPT-4 Technical Report (2023) Phuong and Hutter [2022] Phuong, M., Hutter, M.: Formal algorithms for transformers. arXiv preprint arXiv:2207.09238 (2022) Ji et al. [2023] Ji, Z., Lee, N., Frieske, R., Yu, T., Su, D., Xu, Y., Ishii, E., Bang, Y.J., Madotto, A., Fung, P.: Survey of hallucination in natural language generation. ACM Comput. Surv. 55(12) (2023) https://doi.org/10.1145/3571730 Strubell et al. [2020] Strubell, E., Ganesh, A., McCallum, A.: Energy and policy considerations for modern deep learning research. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, pp. 13693–13696 (2020) Patterson et al. [2021] Patterson, D., Gonzalez, J., Le, Q., Liang, C., Munguia, L.-M., Rothchild, D., So, D., Texier, M., Dean, J.: Carbon emissions and large neural network training. arXiv preprint arXiv:2104.10350 (2021) Wu et al. [2022] Wu, C.-J., Raghavendra, R., Gupta, U., Acun, B., Ardalani, N., Maeng, K., Chang, G., Aga, F., Huang, J., Bai, C., et al.: Sustainable ai: Environmental implications, challenges and opportunities. Proceedings of Machine Learning and Systems 4, 795–813 (2022) Kaack et al. [2022] Kaack, L.H., Donti, P.L., Strubell, E., Kamiya, G., Creutzig, F., Rolnick, D.: Aligning artificial intelligence with climate change mitigation. Nature Climate Change 12(6), 518–527 (2022) Zhou et al. [2022] Zhou, H., Nova, A., Larochelle, H., Courville, A., Neyshabur, B., Sedghi, H.: Teaching algorithmic reasoning via in-context learning. arXiv preprint arXiv:2211.09066 (2022) Izacard et al. [2022] Izacard, G., Lewis, P., Lomeli, M., Hosseini, L., Petroni, F., Schick, T., Dwivedi-Yu, J., Joulin, A., Riedel, S., Grave, E.: Few-shot learning with retrieval augmented language models. arXiv preprint arXiv:2208.03299 (2022) Wei et al. [2022] Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E., Le, Q., Zhou, D.: Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903 (2022) Wang et al. [2023] Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-Consistency Improves Chain of Thought Reasoning in Language Models (2023) Shao et al. [2023] Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Liu, P., Yuan, W., Fu, J., Jiang, Z., Hayashi, H., Neubig, G.: Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing. ACM Computing Surveys 55(9), 1–35 (2023) Radford et al. [2019] Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., Sutskever, I., et al.: Language models are unsupervised multitask learners. OpenAI blog 1(8), 9 (2019) Brown et al. [2020] Brown, T.B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D.M., Wu, J., Winter, C., Hesse, C., Chen, M., Sigler, E., Litwin, M., Gray, S., Chess, B., Clark, J., Berner, C., McCandlish, S., Radford, A., Sutskever, I., Amodei, D.: Language Models are Few-Shot Learners (2020) OpenAI [2023] OpenAI: GPT-4 Technical Report (2023) Phuong and Hutter [2022] Phuong, M., Hutter, M.: Formal algorithms for transformers. arXiv preprint arXiv:2207.09238 (2022) Ji et al. [2023] Ji, Z., Lee, N., Frieske, R., Yu, T., Su, D., Xu, Y., Ishii, E., Bang, Y.J., Madotto, A., Fung, P.: Survey of hallucination in natural language generation. ACM Comput. Surv. 55(12) (2023) https://doi.org/10.1145/3571730 Strubell et al. [2020] Strubell, E., Ganesh, A., McCallum, A.: Energy and policy considerations for modern deep learning research. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, pp. 13693–13696 (2020) Patterson et al. [2021] Patterson, D., Gonzalez, J., Le, Q., Liang, C., Munguia, L.-M., Rothchild, D., So, D., Texier, M., Dean, J.: Carbon emissions and large neural network training. arXiv preprint arXiv:2104.10350 (2021) Wu et al. [2022] Wu, C.-J., Raghavendra, R., Gupta, U., Acun, B., Ardalani, N., Maeng, K., Chang, G., Aga, F., Huang, J., Bai, C., et al.: Sustainable ai: Environmental implications, challenges and opportunities. Proceedings of Machine Learning and Systems 4, 795–813 (2022) Kaack et al. [2022] Kaack, L.H., Donti, P.L., Strubell, E., Kamiya, G., Creutzig, F., Rolnick, D.: Aligning artificial intelligence with climate change mitigation. Nature Climate Change 12(6), 518–527 (2022) Zhou et al. [2022] Zhou, H., Nova, A., Larochelle, H., Courville, A., Neyshabur, B., Sedghi, H.: Teaching algorithmic reasoning via in-context learning. arXiv preprint arXiv:2211.09066 (2022) Izacard et al. [2022] Izacard, G., Lewis, P., Lomeli, M., Hosseini, L., Petroni, F., Schick, T., Dwivedi-Yu, J., Joulin, A., Riedel, S., Grave, E.: Few-shot learning with retrieval augmented language models. arXiv preprint arXiv:2208.03299 (2022) Wei et al. [2022] Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E., Le, Q., Zhou, D.: Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903 (2022) Wang et al. [2023] Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-Consistency Improves Chain of Thought Reasoning in Language Models (2023) Shao et al. [2023] Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., Sutskever, I., et al.: Language models are unsupervised multitask learners. OpenAI blog 1(8), 9 (2019) Brown et al. [2020] Brown, T.B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D.M., Wu, J., Winter, C., Hesse, C., Chen, M., Sigler, E., Litwin, M., Gray, S., Chess, B., Clark, J., Berner, C., McCandlish, S., Radford, A., Sutskever, I., Amodei, D.: Language Models are Few-Shot Learners (2020) OpenAI [2023] OpenAI: GPT-4 Technical Report (2023) Phuong and Hutter [2022] Phuong, M., Hutter, M.: Formal algorithms for transformers. arXiv preprint arXiv:2207.09238 (2022) Ji et al. [2023] Ji, Z., Lee, N., Frieske, R., Yu, T., Su, D., Xu, Y., Ishii, E., Bang, Y.J., Madotto, A., Fung, P.: Survey of hallucination in natural language generation. ACM Comput. Surv. 55(12) (2023) https://doi.org/10.1145/3571730 Strubell et al. [2020] Strubell, E., Ganesh, A., McCallum, A.: Energy and policy considerations for modern deep learning research. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, pp. 13693–13696 (2020) Patterson et al. [2021] Patterson, D., Gonzalez, J., Le, Q., Liang, C., Munguia, L.-M., Rothchild, D., So, D., Texier, M., Dean, J.: Carbon emissions and large neural network training. arXiv preprint arXiv:2104.10350 (2021) Wu et al. [2022] Wu, C.-J., Raghavendra, R., Gupta, U., Acun, B., Ardalani, N., Maeng, K., Chang, G., Aga, F., Huang, J., Bai, C., et al.: Sustainable ai: Environmental implications, challenges and opportunities. Proceedings of Machine Learning and Systems 4, 795–813 (2022) Kaack et al. [2022] Kaack, L.H., Donti, P.L., Strubell, E., Kamiya, G., Creutzig, F., Rolnick, D.: Aligning artificial intelligence with climate change mitigation. Nature Climate Change 12(6), 518–527 (2022) Zhou et al. [2022] Zhou, H., Nova, A., Larochelle, H., Courville, A., Neyshabur, B., Sedghi, H.: Teaching algorithmic reasoning via in-context learning. arXiv preprint arXiv:2211.09066 (2022) Izacard et al. [2022] Izacard, G., Lewis, P., Lomeli, M., Hosseini, L., Petroni, F., Schick, T., Dwivedi-Yu, J., Joulin, A., Riedel, S., Grave, E.: Few-shot learning with retrieval augmented language models. arXiv preprint arXiv:2208.03299 (2022) Wei et al. [2022] Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E., Le, Q., Zhou, D.: Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903 (2022) Wang et al. [2023] Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-Consistency Improves Chain of Thought Reasoning in Language Models (2023) Shao et al. [2023] Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Brown, T.B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D.M., Wu, J., Winter, C., Hesse, C., Chen, M., Sigler, E., Litwin, M., Gray, S., Chess, B., Clark, J., Berner, C., McCandlish, S., Radford, A., Sutskever, I., Amodei, D.: Language Models are Few-Shot Learners (2020) OpenAI [2023] OpenAI: GPT-4 Technical Report (2023) Phuong and Hutter [2022] Phuong, M., Hutter, M.: Formal algorithms for transformers. arXiv preprint arXiv:2207.09238 (2022) Ji et al. [2023] Ji, Z., Lee, N., Frieske, R., Yu, T., Su, D., Xu, Y., Ishii, E., Bang, Y.J., Madotto, A., Fung, P.: Survey of hallucination in natural language generation. ACM Comput. Surv. 55(12) (2023) https://doi.org/10.1145/3571730 Strubell et al. [2020] Strubell, E., Ganesh, A., McCallum, A.: Energy and policy considerations for modern deep learning research. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, pp. 13693–13696 (2020) Patterson et al. [2021] Patterson, D., Gonzalez, J., Le, Q., Liang, C., Munguia, L.-M., Rothchild, D., So, D., Texier, M., Dean, J.: Carbon emissions and large neural network training. arXiv preprint arXiv:2104.10350 (2021) Wu et al. [2022] Wu, C.-J., Raghavendra, R., Gupta, U., Acun, B., Ardalani, N., Maeng, K., Chang, G., Aga, F., Huang, J., Bai, C., et al.: Sustainable ai: Environmental implications, challenges and opportunities. Proceedings of Machine Learning and Systems 4, 795–813 (2022) Kaack et al. [2022] Kaack, L.H., Donti, P.L., Strubell, E., Kamiya, G., Creutzig, F., Rolnick, D.: Aligning artificial intelligence with climate change mitigation. Nature Climate Change 12(6), 518–527 (2022) Zhou et al. [2022] Zhou, H., Nova, A., Larochelle, H., Courville, A., Neyshabur, B., Sedghi, H.: Teaching algorithmic reasoning via in-context learning. arXiv preprint arXiv:2211.09066 (2022) Izacard et al. [2022] Izacard, G., Lewis, P., Lomeli, M., Hosseini, L., Petroni, F., Schick, T., Dwivedi-Yu, J., Joulin, A., Riedel, S., Grave, E.: Few-shot learning with retrieval augmented language models. arXiv preprint arXiv:2208.03299 (2022) Wei et al. [2022] Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E., Le, Q., Zhou, D.: Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903 (2022) Wang et al. [2023] Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-Consistency Improves Chain of Thought Reasoning in Language Models (2023) Shao et al. [2023] Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) OpenAI: GPT-4 Technical Report (2023) Phuong and Hutter [2022] Phuong, M., Hutter, M.: Formal algorithms for transformers. arXiv preprint arXiv:2207.09238 (2022) Ji et al. [2023] Ji, Z., Lee, N., Frieske, R., Yu, T., Su, D., Xu, Y., Ishii, E., Bang, Y.J., Madotto, A., Fung, P.: Survey of hallucination in natural language generation. ACM Comput. Surv. 55(12) (2023) https://doi.org/10.1145/3571730 Strubell et al. [2020] Strubell, E., Ganesh, A., McCallum, A.: Energy and policy considerations for modern deep learning research. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, pp. 13693–13696 (2020) Patterson et al. [2021] Patterson, D., Gonzalez, J., Le, Q., Liang, C., Munguia, L.-M., Rothchild, D., So, D., Texier, M., Dean, J.: Carbon emissions and large neural network training. arXiv preprint arXiv:2104.10350 (2021) Wu et al. [2022] Wu, C.-J., Raghavendra, R., Gupta, U., Acun, B., Ardalani, N., Maeng, K., Chang, G., Aga, F., Huang, J., Bai, C., et al.: Sustainable ai: Environmental implications, challenges and opportunities. Proceedings of Machine Learning and Systems 4, 795–813 (2022) Kaack et al. [2022] Kaack, L.H., Donti, P.L., Strubell, E., Kamiya, G., Creutzig, F., Rolnick, D.: Aligning artificial intelligence with climate change mitigation. Nature Climate Change 12(6), 518–527 (2022) Zhou et al. [2022] Zhou, H., Nova, A., Larochelle, H., Courville, A., Neyshabur, B., Sedghi, H.: Teaching algorithmic reasoning via in-context learning. arXiv preprint arXiv:2211.09066 (2022) Izacard et al. [2022] Izacard, G., Lewis, P., Lomeli, M., Hosseini, L., Petroni, F., Schick, T., Dwivedi-Yu, J., Joulin, A., Riedel, S., Grave, E.: Few-shot learning with retrieval augmented language models. arXiv preprint arXiv:2208.03299 (2022) Wei et al. [2022] Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E., Le, Q., Zhou, D.: Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903 (2022) Wang et al. [2023] Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-Consistency Improves Chain of Thought Reasoning in Language Models (2023) Shao et al. [2023] Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Phuong, M., Hutter, M.: Formal algorithms for transformers. arXiv preprint arXiv:2207.09238 (2022) Ji et al. [2023] Ji, Z., Lee, N., Frieske, R., Yu, T., Su, D., Xu, Y., Ishii, E., Bang, Y.J., Madotto, A., Fung, P.: Survey of hallucination in natural language generation. ACM Comput. Surv. 55(12) (2023) https://doi.org/10.1145/3571730 Strubell et al. [2020] Strubell, E., Ganesh, A., McCallum, A.: Energy and policy considerations for modern deep learning research. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, pp. 13693–13696 (2020) Patterson et al. [2021] Patterson, D., Gonzalez, J., Le, Q., Liang, C., Munguia, L.-M., Rothchild, D., So, D., Texier, M., Dean, J.: Carbon emissions and large neural network training. arXiv preprint arXiv:2104.10350 (2021) Wu et al. [2022] Wu, C.-J., Raghavendra, R., Gupta, U., Acun, B., Ardalani, N., Maeng, K., Chang, G., Aga, F., Huang, J., Bai, C., et al.: Sustainable ai: Environmental implications, challenges and opportunities. Proceedings of Machine Learning and Systems 4, 795–813 (2022) Kaack et al. [2022] Kaack, L.H., Donti, P.L., Strubell, E., Kamiya, G., Creutzig, F., Rolnick, D.: Aligning artificial intelligence with climate change mitigation. Nature Climate Change 12(6), 518–527 (2022) Zhou et al. [2022] Zhou, H., Nova, A., Larochelle, H., Courville, A., Neyshabur, B., Sedghi, H.: Teaching algorithmic reasoning via in-context learning. arXiv preprint arXiv:2211.09066 (2022) Izacard et al. [2022] Izacard, G., Lewis, P., Lomeli, M., Hosseini, L., Petroni, F., Schick, T., Dwivedi-Yu, J., Joulin, A., Riedel, S., Grave, E.: Few-shot learning with retrieval augmented language models. arXiv preprint arXiv:2208.03299 (2022) Wei et al. [2022] Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E., Le, Q., Zhou, D.: Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903 (2022) Wang et al. [2023] Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-Consistency Improves Chain of Thought Reasoning in Language Models (2023) Shao et al. [2023] Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Ji, Z., Lee, N., Frieske, R., Yu, T., Su, D., Xu, Y., Ishii, E., Bang, Y.J., Madotto, A., Fung, P.: Survey of hallucination in natural language generation. ACM Comput. Surv. 55(12) (2023) https://doi.org/10.1145/3571730 Strubell et al. [2020] Strubell, E., Ganesh, A., McCallum, A.: Energy and policy considerations for modern deep learning research. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, pp. 13693–13696 (2020) Patterson et al. [2021] Patterson, D., Gonzalez, J., Le, Q., Liang, C., Munguia, L.-M., Rothchild, D., So, D., Texier, M., Dean, J.: Carbon emissions and large neural network training. arXiv preprint arXiv:2104.10350 (2021) Wu et al. [2022] Wu, C.-J., Raghavendra, R., Gupta, U., Acun, B., Ardalani, N., Maeng, K., Chang, G., Aga, F., Huang, J., Bai, C., et al.: Sustainable ai: Environmental implications, challenges and opportunities. Proceedings of Machine Learning and Systems 4, 795–813 (2022) Kaack et al. [2022] Kaack, L.H., Donti, P.L., Strubell, E., Kamiya, G., Creutzig, F., Rolnick, D.: Aligning artificial intelligence with climate change mitigation. Nature Climate Change 12(6), 518–527 (2022) Zhou et al. [2022] Zhou, H., Nova, A., Larochelle, H., Courville, A., Neyshabur, B., Sedghi, H.: Teaching algorithmic reasoning via in-context learning. arXiv preprint arXiv:2211.09066 (2022) Izacard et al. [2022] Izacard, G., Lewis, P., Lomeli, M., Hosseini, L., Petroni, F., Schick, T., Dwivedi-Yu, J., Joulin, A., Riedel, S., Grave, E.: Few-shot learning with retrieval augmented language models. arXiv preprint arXiv:2208.03299 (2022) Wei et al. [2022] Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E., Le, Q., Zhou, D.: Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903 (2022) Wang et al. [2023] Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-Consistency Improves Chain of Thought Reasoning in Language Models (2023) Shao et al. [2023] Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Strubell, E., Ganesh, A., McCallum, A.: Energy and policy considerations for modern deep learning research. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, pp. 13693–13696 (2020) Patterson et al. [2021] Patterson, D., Gonzalez, J., Le, Q., Liang, C., Munguia, L.-M., Rothchild, D., So, D., Texier, M., Dean, J.: Carbon emissions and large neural network training. arXiv preprint arXiv:2104.10350 (2021) Wu et al. [2022] Wu, C.-J., Raghavendra, R., Gupta, U., Acun, B., Ardalani, N., Maeng, K., Chang, G., Aga, F., Huang, J., Bai, C., et al.: Sustainable ai: Environmental implications, challenges and opportunities. Proceedings of Machine Learning and Systems 4, 795–813 (2022) Kaack et al. [2022] Kaack, L.H., Donti, P.L., Strubell, E., Kamiya, G., Creutzig, F., Rolnick, D.: Aligning artificial intelligence with climate change mitigation. Nature Climate Change 12(6), 518–527 (2022) Zhou et al. [2022] Zhou, H., Nova, A., Larochelle, H., Courville, A., Neyshabur, B., Sedghi, H.: Teaching algorithmic reasoning via in-context learning. arXiv preprint arXiv:2211.09066 (2022) Izacard et al. [2022] Izacard, G., Lewis, P., Lomeli, M., Hosseini, L., Petroni, F., Schick, T., Dwivedi-Yu, J., Joulin, A., Riedel, S., Grave, E.: Few-shot learning with retrieval augmented language models. arXiv preprint arXiv:2208.03299 (2022) Wei et al. [2022] Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E., Le, Q., Zhou, D.: Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903 (2022) Wang et al. [2023] Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-Consistency Improves Chain of Thought Reasoning in Language Models (2023) Shao et al. [2023] Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Patterson, D., Gonzalez, J., Le, Q., Liang, C., Munguia, L.-M., Rothchild, D., So, D., Texier, M., Dean, J.: Carbon emissions and large neural network training. arXiv preprint arXiv:2104.10350 (2021) Wu et al. [2022] Wu, C.-J., Raghavendra, R., Gupta, U., Acun, B., Ardalani, N., Maeng, K., Chang, G., Aga, F., Huang, J., Bai, C., et al.: Sustainable ai: Environmental implications, challenges and opportunities. Proceedings of Machine Learning and Systems 4, 795–813 (2022) Kaack et al. [2022] Kaack, L.H., Donti, P.L., Strubell, E., Kamiya, G., Creutzig, F., Rolnick, D.: Aligning artificial intelligence with climate change mitigation. Nature Climate Change 12(6), 518–527 (2022) Zhou et al. [2022] Zhou, H., Nova, A., Larochelle, H., Courville, A., Neyshabur, B., Sedghi, H.: Teaching algorithmic reasoning via in-context learning. arXiv preprint arXiv:2211.09066 (2022) Izacard et al. [2022] Izacard, G., Lewis, P., Lomeli, M., Hosseini, L., Petroni, F., Schick, T., Dwivedi-Yu, J., Joulin, A., Riedel, S., Grave, E.: Few-shot learning with retrieval augmented language models. arXiv preprint arXiv:2208.03299 (2022) Wei et al. [2022] Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E., Le, Q., Zhou, D.: Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903 (2022) Wang et al. [2023] Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-Consistency Improves Chain of Thought Reasoning in Language Models (2023) Shao et al. [2023] Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Wu, C.-J., Raghavendra, R., Gupta, U., Acun, B., Ardalani, N., Maeng, K., Chang, G., Aga, F., Huang, J., Bai, C., et al.: Sustainable ai: Environmental implications, challenges and opportunities. Proceedings of Machine Learning and Systems 4, 795–813 (2022) Kaack et al. [2022] Kaack, L.H., Donti, P.L., Strubell, E., Kamiya, G., Creutzig, F., Rolnick, D.: Aligning artificial intelligence with climate change mitigation. Nature Climate Change 12(6), 518–527 (2022) Zhou et al. [2022] Zhou, H., Nova, A., Larochelle, H., Courville, A., Neyshabur, B., Sedghi, H.: Teaching algorithmic reasoning via in-context learning. arXiv preprint arXiv:2211.09066 (2022) Izacard et al. [2022] Izacard, G., Lewis, P., Lomeli, M., Hosseini, L., Petroni, F., Schick, T., Dwivedi-Yu, J., Joulin, A., Riedel, S., Grave, E.: Few-shot learning with retrieval augmented language models. arXiv preprint arXiv:2208.03299 (2022) Wei et al. [2022] Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E., Le, Q., Zhou, D.: Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903 (2022) Wang et al. [2023] Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-Consistency Improves Chain of Thought Reasoning in Language Models (2023) Shao et al. [2023] Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Kaack, L.H., Donti, P.L., Strubell, E., Kamiya, G., Creutzig, F., Rolnick, D.: Aligning artificial intelligence with climate change mitigation. Nature Climate Change 12(6), 518–527 (2022) Zhou et al. [2022] Zhou, H., Nova, A., Larochelle, H., Courville, A., Neyshabur, B., Sedghi, H.: Teaching algorithmic reasoning via in-context learning. arXiv preprint arXiv:2211.09066 (2022) Izacard et al. [2022] Izacard, G., Lewis, P., Lomeli, M., Hosseini, L., Petroni, F., Schick, T., Dwivedi-Yu, J., Joulin, A., Riedel, S., Grave, E.: Few-shot learning with retrieval augmented language models. arXiv preprint arXiv:2208.03299 (2022) Wei et al. [2022] Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E., Le, Q., Zhou, D.: Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903 (2022) Wang et al. [2023] Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-Consistency Improves Chain of Thought Reasoning in Language Models (2023) Shao et al. [2023] Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Zhou, H., Nova, A., Larochelle, H., Courville, A., Neyshabur, B., Sedghi, H.: Teaching algorithmic reasoning via in-context learning. arXiv preprint arXiv:2211.09066 (2022) Izacard et al. [2022] Izacard, G., Lewis, P., Lomeli, M., Hosseini, L., Petroni, F., Schick, T., Dwivedi-Yu, J., Joulin, A., Riedel, S., Grave, E.: Few-shot learning with retrieval augmented language models. arXiv preprint arXiv:2208.03299 (2022) Wei et al. [2022] Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E., Le, Q., Zhou, D.: Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903 (2022) Wang et al. [2023] Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-Consistency Improves Chain of Thought Reasoning in Language Models (2023) Shao et al. [2023] Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Izacard, G., Lewis, P., Lomeli, M., Hosseini, L., Petroni, F., Schick, T., Dwivedi-Yu, J., Joulin, A., Riedel, S., Grave, E.: Few-shot learning with retrieval augmented language models. arXiv preprint arXiv:2208.03299 (2022) Wei et al. [2022] Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E., Le, Q., Zhou, D.: Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903 (2022) Wang et al. [2023] Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-Consistency Improves Chain of Thought Reasoning in Language Models (2023) Shao et al. [2023] Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E., Le, Q., Zhou, D.: Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903 (2022) Wang et al. [2023] Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-Consistency Improves Chain of Thought Reasoning in Language Models (2023) Shao et al. [2023] Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-Consistency Improves Chain of Thought Reasoning in Language Models (2023) Shao et al. [2023] Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023)
  5. O’Neill, M., Vanneschi, L., Gustafson, S., Banzhaf, W.: Open issues in genetic programming. Genetic Programming and Evolvable Machines 11, 339–363 (2010) O’Neill and Spector [2020] O’Neill, M., Spector, L.: Automatic programming: The open issue? Genetic Programming and Evolvable Machines 21, 251–262 (2020) Liu et al. [2023] Liu, P., Yuan, W., Fu, J., Jiang, Z., Hayashi, H., Neubig, G.: Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing. ACM Computing Surveys 55(9), 1–35 (2023) Radford et al. [2019] Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., Sutskever, I., et al.: Language models are unsupervised multitask learners. OpenAI blog 1(8), 9 (2019) Brown et al. [2020] Brown, T.B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D.M., Wu, J., Winter, C., Hesse, C., Chen, M., Sigler, E., Litwin, M., Gray, S., Chess, B., Clark, J., Berner, C., McCandlish, S., Radford, A., Sutskever, I., Amodei, D.: Language Models are Few-Shot Learners (2020) OpenAI [2023] OpenAI: GPT-4 Technical Report (2023) Phuong and Hutter [2022] Phuong, M., Hutter, M.: Formal algorithms for transformers. arXiv preprint arXiv:2207.09238 (2022) Ji et al. [2023] Ji, Z., Lee, N., Frieske, R., Yu, T., Su, D., Xu, Y., Ishii, E., Bang, Y.J., Madotto, A., Fung, P.: Survey of hallucination in natural language generation. ACM Comput. Surv. 55(12) (2023) https://doi.org/10.1145/3571730 Strubell et al. [2020] Strubell, E., Ganesh, A., McCallum, A.: Energy and policy considerations for modern deep learning research. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, pp. 13693–13696 (2020) Patterson et al. [2021] Patterson, D., Gonzalez, J., Le, Q., Liang, C., Munguia, L.-M., Rothchild, D., So, D., Texier, M., Dean, J.: Carbon emissions and large neural network training. arXiv preprint arXiv:2104.10350 (2021) Wu et al. [2022] Wu, C.-J., Raghavendra, R., Gupta, U., Acun, B., Ardalani, N., Maeng, K., Chang, G., Aga, F., Huang, J., Bai, C., et al.: Sustainable ai: Environmental implications, challenges and opportunities. Proceedings of Machine Learning and Systems 4, 795–813 (2022) Kaack et al. [2022] Kaack, L.H., Donti, P.L., Strubell, E., Kamiya, G., Creutzig, F., Rolnick, D.: Aligning artificial intelligence with climate change mitigation. Nature Climate Change 12(6), 518–527 (2022) Zhou et al. [2022] Zhou, H., Nova, A., Larochelle, H., Courville, A., Neyshabur, B., Sedghi, H.: Teaching algorithmic reasoning via in-context learning. arXiv preprint arXiv:2211.09066 (2022) Izacard et al. [2022] Izacard, G., Lewis, P., Lomeli, M., Hosseini, L., Petroni, F., Schick, T., Dwivedi-Yu, J., Joulin, A., Riedel, S., Grave, E.: Few-shot learning with retrieval augmented language models. arXiv preprint arXiv:2208.03299 (2022) Wei et al. [2022] Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E., Le, Q., Zhou, D.: Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903 (2022) Wang et al. [2023] Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-Consistency Improves Chain of Thought Reasoning in Language Models (2023) Shao et al. [2023] Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) O’Neill, M., Spector, L.: Automatic programming: The open issue? Genetic Programming and Evolvable Machines 21, 251–262 (2020) Liu et al. [2023] Liu, P., Yuan, W., Fu, J., Jiang, Z., Hayashi, H., Neubig, G.: Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing. ACM Computing Surveys 55(9), 1–35 (2023) Radford et al. [2019] Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., Sutskever, I., et al.: Language models are unsupervised multitask learners. OpenAI blog 1(8), 9 (2019) Brown et al. [2020] Brown, T.B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D.M., Wu, J., Winter, C., Hesse, C., Chen, M., Sigler, E., Litwin, M., Gray, S., Chess, B., Clark, J., Berner, C., McCandlish, S., Radford, A., Sutskever, I., Amodei, D.: Language Models are Few-Shot Learners (2020) OpenAI [2023] OpenAI: GPT-4 Technical Report (2023) Phuong and Hutter [2022] Phuong, M., Hutter, M.: Formal algorithms for transformers. arXiv preprint arXiv:2207.09238 (2022) Ji et al. [2023] Ji, Z., Lee, N., Frieske, R., Yu, T., Su, D., Xu, Y., Ishii, E., Bang, Y.J., Madotto, A., Fung, P.: Survey of hallucination in natural language generation. ACM Comput. Surv. 55(12) (2023) https://doi.org/10.1145/3571730 Strubell et al. [2020] Strubell, E., Ganesh, A., McCallum, A.: Energy and policy considerations for modern deep learning research. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, pp. 13693–13696 (2020) Patterson et al. [2021] Patterson, D., Gonzalez, J., Le, Q., Liang, C., Munguia, L.-M., Rothchild, D., So, D., Texier, M., Dean, J.: Carbon emissions and large neural network training. arXiv preprint arXiv:2104.10350 (2021) Wu et al. [2022] Wu, C.-J., Raghavendra, R., Gupta, U., Acun, B., Ardalani, N., Maeng, K., Chang, G., Aga, F., Huang, J., Bai, C., et al.: Sustainable ai: Environmental implications, challenges and opportunities. Proceedings of Machine Learning and Systems 4, 795–813 (2022) Kaack et al. [2022] Kaack, L.H., Donti, P.L., Strubell, E., Kamiya, G., Creutzig, F., Rolnick, D.: Aligning artificial intelligence with climate change mitigation. Nature Climate Change 12(6), 518–527 (2022) Zhou et al. [2022] Zhou, H., Nova, A., Larochelle, H., Courville, A., Neyshabur, B., Sedghi, H.: Teaching algorithmic reasoning via in-context learning. arXiv preprint arXiv:2211.09066 (2022) Izacard et al. [2022] Izacard, G., Lewis, P., Lomeli, M., Hosseini, L., Petroni, F., Schick, T., Dwivedi-Yu, J., Joulin, A., Riedel, S., Grave, E.: Few-shot learning with retrieval augmented language models. arXiv preprint arXiv:2208.03299 (2022) Wei et al. [2022] Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E., Le, Q., Zhou, D.: Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903 (2022) Wang et al. [2023] Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-Consistency Improves Chain of Thought Reasoning in Language Models (2023) Shao et al. [2023] Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Liu, P., Yuan, W., Fu, J., Jiang, Z., Hayashi, H., Neubig, G.: Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing. ACM Computing Surveys 55(9), 1–35 (2023) Radford et al. [2019] Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., Sutskever, I., et al.: Language models are unsupervised multitask learners. OpenAI blog 1(8), 9 (2019) Brown et al. [2020] Brown, T.B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D.M., Wu, J., Winter, C., Hesse, C., Chen, M., Sigler, E., Litwin, M., Gray, S., Chess, B., Clark, J., Berner, C., McCandlish, S., Radford, A., Sutskever, I., Amodei, D.: Language Models are Few-Shot Learners (2020) OpenAI [2023] OpenAI: GPT-4 Technical Report (2023) Phuong and Hutter [2022] Phuong, M., Hutter, M.: Formal algorithms for transformers. arXiv preprint arXiv:2207.09238 (2022) Ji et al. [2023] Ji, Z., Lee, N., Frieske, R., Yu, T., Su, D., Xu, Y., Ishii, E., Bang, Y.J., Madotto, A., Fung, P.: Survey of hallucination in natural language generation. ACM Comput. Surv. 55(12) (2023) https://doi.org/10.1145/3571730 Strubell et al. [2020] Strubell, E., Ganesh, A., McCallum, A.: Energy and policy considerations for modern deep learning research. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, pp. 13693–13696 (2020) Patterson et al. [2021] Patterson, D., Gonzalez, J., Le, Q., Liang, C., Munguia, L.-M., Rothchild, D., So, D., Texier, M., Dean, J.: Carbon emissions and large neural network training. arXiv preprint arXiv:2104.10350 (2021) Wu et al. [2022] Wu, C.-J., Raghavendra, R., Gupta, U., Acun, B., Ardalani, N., Maeng, K., Chang, G., Aga, F., Huang, J., Bai, C., et al.: Sustainable ai: Environmental implications, challenges and opportunities. Proceedings of Machine Learning and Systems 4, 795–813 (2022) Kaack et al. [2022] Kaack, L.H., Donti, P.L., Strubell, E., Kamiya, G., Creutzig, F., Rolnick, D.: Aligning artificial intelligence with climate change mitigation. Nature Climate Change 12(6), 518–527 (2022) Zhou et al. [2022] Zhou, H., Nova, A., Larochelle, H., Courville, A., Neyshabur, B., Sedghi, H.: Teaching algorithmic reasoning via in-context learning. arXiv preprint arXiv:2211.09066 (2022) Izacard et al. [2022] Izacard, G., Lewis, P., Lomeli, M., Hosseini, L., Petroni, F., Schick, T., Dwivedi-Yu, J., Joulin, A., Riedel, S., Grave, E.: Few-shot learning with retrieval augmented language models. arXiv preprint arXiv:2208.03299 (2022) Wei et al. [2022] Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E., Le, Q., Zhou, D.: Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903 (2022) Wang et al. [2023] Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-Consistency Improves Chain of Thought Reasoning in Language Models (2023) Shao et al. [2023] Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., Sutskever, I., et al.: Language models are unsupervised multitask learners. OpenAI blog 1(8), 9 (2019) Brown et al. [2020] Brown, T.B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D.M., Wu, J., Winter, C., Hesse, C., Chen, M., Sigler, E., Litwin, M., Gray, S., Chess, B., Clark, J., Berner, C., McCandlish, S., Radford, A., Sutskever, I., Amodei, D.: Language Models are Few-Shot Learners (2020) OpenAI [2023] OpenAI: GPT-4 Technical Report (2023) Phuong and Hutter [2022] Phuong, M., Hutter, M.: Formal algorithms for transformers. arXiv preprint arXiv:2207.09238 (2022) Ji et al. [2023] Ji, Z., Lee, N., Frieske, R., Yu, T., Su, D., Xu, Y., Ishii, E., Bang, Y.J., Madotto, A., Fung, P.: Survey of hallucination in natural language generation. ACM Comput. Surv. 55(12) (2023) https://doi.org/10.1145/3571730 Strubell et al. [2020] Strubell, E., Ganesh, A., McCallum, A.: Energy and policy considerations for modern deep learning research. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, pp. 13693–13696 (2020) Patterson et al. [2021] Patterson, D., Gonzalez, J., Le, Q., Liang, C., Munguia, L.-M., Rothchild, D., So, D., Texier, M., Dean, J.: Carbon emissions and large neural network training. arXiv preprint arXiv:2104.10350 (2021) Wu et al. [2022] Wu, C.-J., Raghavendra, R., Gupta, U., Acun, B., Ardalani, N., Maeng, K., Chang, G., Aga, F., Huang, J., Bai, C., et al.: Sustainable ai: Environmental implications, challenges and opportunities. Proceedings of Machine Learning and Systems 4, 795–813 (2022) Kaack et al. [2022] Kaack, L.H., Donti, P.L., Strubell, E., Kamiya, G., Creutzig, F., Rolnick, D.: Aligning artificial intelligence with climate change mitigation. Nature Climate Change 12(6), 518–527 (2022) Zhou et al. [2022] Zhou, H., Nova, A., Larochelle, H., Courville, A., Neyshabur, B., Sedghi, H.: Teaching algorithmic reasoning via in-context learning. arXiv preprint arXiv:2211.09066 (2022) Izacard et al. [2022] Izacard, G., Lewis, P., Lomeli, M., Hosseini, L., Petroni, F., Schick, T., Dwivedi-Yu, J., Joulin, A., Riedel, S., Grave, E.: Few-shot learning with retrieval augmented language models. arXiv preprint arXiv:2208.03299 (2022) Wei et al. [2022] Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E., Le, Q., Zhou, D.: Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903 (2022) Wang et al. [2023] Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-Consistency Improves Chain of Thought Reasoning in Language Models (2023) Shao et al. [2023] Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Brown, T.B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D.M., Wu, J., Winter, C., Hesse, C., Chen, M., Sigler, E., Litwin, M., Gray, S., Chess, B., Clark, J., Berner, C., McCandlish, S., Radford, A., Sutskever, I., Amodei, D.: Language Models are Few-Shot Learners (2020) OpenAI [2023] OpenAI: GPT-4 Technical Report (2023) Phuong and Hutter [2022] Phuong, M., Hutter, M.: Formal algorithms for transformers. arXiv preprint arXiv:2207.09238 (2022) Ji et al. [2023] Ji, Z., Lee, N., Frieske, R., Yu, T., Su, D., Xu, Y., Ishii, E., Bang, Y.J., Madotto, A., Fung, P.: Survey of hallucination in natural language generation. ACM Comput. Surv. 55(12) (2023) https://doi.org/10.1145/3571730 Strubell et al. [2020] Strubell, E., Ganesh, A., McCallum, A.: Energy and policy considerations for modern deep learning research. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, pp. 13693–13696 (2020) Patterson et al. [2021] Patterson, D., Gonzalez, J., Le, Q., Liang, C., Munguia, L.-M., Rothchild, D., So, D., Texier, M., Dean, J.: Carbon emissions and large neural network training. arXiv preprint arXiv:2104.10350 (2021) Wu et al. [2022] Wu, C.-J., Raghavendra, R., Gupta, U., Acun, B., Ardalani, N., Maeng, K., Chang, G., Aga, F., Huang, J., Bai, C., et al.: Sustainable ai: Environmental implications, challenges and opportunities. Proceedings of Machine Learning and Systems 4, 795–813 (2022) Kaack et al. [2022] Kaack, L.H., Donti, P.L., Strubell, E., Kamiya, G., Creutzig, F., Rolnick, D.: Aligning artificial intelligence with climate change mitigation. Nature Climate Change 12(6), 518–527 (2022) Zhou et al. [2022] Zhou, H., Nova, A., Larochelle, H., Courville, A., Neyshabur, B., Sedghi, H.: Teaching algorithmic reasoning via in-context learning. arXiv preprint arXiv:2211.09066 (2022) Izacard et al. [2022] Izacard, G., Lewis, P., Lomeli, M., Hosseini, L., Petroni, F., Schick, T., Dwivedi-Yu, J., Joulin, A., Riedel, S., Grave, E.: Few-shot learning with retrieval augmented language models. arXiv preprint arXiv:2208.03299 (2022) Wei et al. [2022] Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E., Le, Q., Zhou, D.: Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903 (2022) Wang et al. [2023] Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-Consistency Improves Chain of Thought Reasoning in Language Models (2023) Shao et al. [2023] Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) OpenAI: GPT-4 Technical Report (2023) Phuong and Hutter [2022] Phuong, M., Hutter, M.: Formal algorithms for transformers. arXiv preprint arXiv:2207.09238 (2022) Ji et al. [2023] Ji, Z., Lee, N., Frieske, R., Yu, T., Su, D., Xu, Y., Ishii, E., Bang, Y.J., Madotto, A., Fung, P.: Survey of hallucination in natural language generation. ACM Comput. Surv. 55(12) (2023) https://doi.org/10.1145/3571730 Strubell et al. [2020] Strubell, E., Ganesh, A., McCallum, A.: Energy and policy considerations for modern deep learning research. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, pp. 13693–13696 (2020) Patterson et al. [2021] Patterson, D., Gonzalez, J., Le, Q., Liang, C., Munguia, L.-M., Rothchild, D., So, D., Texier, M., Dean, J.: Carbon emissions and large neural network training. arXiv preprint arXiv:2104.10350 (2021) Wu et al. [2022] Wu, C.-J., Raghavendra, R., Gupta, U., Acun, B., Ardalani, N., Maeng, K., Chang, G., Aga, F., Huang, J., Bai, C., et al.: Sustainable ai: Environmental implications, challenges and opportunities. Proceedings of Machine Learning and Systems 4, 795–813 (2022) Kaack et al. [2022] Kaack, L.H., Donti, P.L., Strubell, E., Kamiya, G., Creutzig, F., Rolnick, D.: Aligning artificial intelligence with climate change mitigation. Nature Climate Change 12(6), 518–527 (2022) Zhou et al. [2022] Zhou, H., Nova, A., Larochelle, H., Courville, A., Neyshabur, B., Sedghi, H.: Teaching algorithmic reasoning via in-context learning. arXiv preprint arXiv:2211.09066 (2022) Izacard et al. [2022] Izacard, G., Lewis, P., Lomeli, M., Hosseini, L., Petroni, F., Schick, T., Dwivedi-Yu, J., Joulin, A., Riedel, S., Grave, E.: Few-shot learning with retrieval augmented language models. arXiv preprint arXiv:2208.03299 (2022) Wei et al. [2022] Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E., Le, Q., Zhou, D.: Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903 (2022) Wang et al. [2023] Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-Consistency Improves Chain of Thought Reasoning in Language Models (2023) Shao et al. [2023] Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Phuong, M., Hutter, M.: Formal algorithms for transformers. arXiv preprint arXiv:2207.09238 (2022) Ji et al. [2023] Ji, Z., Lee, N., Frieske, R., Yu, T., Su, D., Xu, Y., Ishii, E., Bang, Y.J., Madotto, A., Fung, P.: Survey of hallucination in natural language generation. ACM Comput. Surv. 55(12) (2023) https://doi.org/10.1145/3571730 Strubell et al. [2020] Strubell, E., Ganesh, A., McCallum, A.: Energy and policy considerations for modern deep learning research. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, pp. 13693–13696 (2020) Patterson et al. [2021] Patterson, D., Gonzalez, J., Le, Q., Liang, C., Munguia, L.-M., Rothchild, D., So, D., Texier, M., Dean, J.: Carbon emissions and large neural network training. arXiv preprint arXiv:2104.10350 (2021) Wu et al. [2022] Wu, C.-J., Raghavendra, R., Gupta, U., Acun, B., Ardalani, N., Maeng, K., Chang, G., Aga, F., Huang, J., Bai, C., et al.: Sustainable ai: Environmental implications, challenges and opportunities. Proceedings of Machine Learning and Systems 4, 795–813 (2022) Kaack et al. [2022] Kaack, L.H., Donti, P.L., Strubell, E., Kamiya, G., Creutzig, F., Rolnick, D.: Aligning artificial intelligence with climate change mitigation. Nature Climate Change 12(6), 518–527 (2022) Zhou et al. [2022] Zhou, H., Nova, A., Larochelle, H., Courville, A., Neyshabur, B., Sedghi, H.: Teaching algorithmic reasoning via in-context learning. arXiv preprint arXiv:2211.09066 (2022) Izacard et al. [2022] Izacard, G., Lewis, P., Lomeli, M., Hosseini, L., Petroni, F., Schick, T., Dwivedi-Yu, J., Joulin, A., Riedel, S., Grave, E.: Few-shot learning with retrieval augmented language models. arXiv preprint arXiv:2208.03299 (2022) Wei et al. [2022] Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E., Le, Q., Zhou, D.: Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903 (2022) Wang et al. [2023] Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-Consistency Improves Chain of Thought Reasoning in Language Models (2023) Shao et al. [2023] Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Ji, Z., Lee, N., Frieske, R., Yu, T., Su, D., Xu, Y., Ishii, E., Bang, Y.J., Madotto, A., Fung, P.: Survey of hallucination in natural language generation. ACM Comput. Surv. 55(12) (2023) https://doi.org/10.1145/3571730 Strubell et al. [2020] Strubell, E., Ganesh, A., McCallum, A.: Energy and policy considerations for modern deep learning research. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, pp. 13693–13696 (2020) Patterson et al. [2021] Patterson, D., Gonzalez, J., Le, Q., Liang, C., Munguia, L.-M., Rothchild, D., So, D., Texier, M., Dean, J.: Carbon emissions and large neural network training. arXiv preprint arXiv:2104.10350 (2021) Wu et al. [2022] Wu, C.-J., Raghavendra, R., Gupta, U., Acun, B., Ardalani, N., Maeng, K., Chang, G., Aga, F., Huang, J., Bai, C., et al.: Sustainable ai: Environmental implications, challenges and opportunities. Proceedings of Machine Learning and Systems 4, 795–813 (2022) Kaack et al. [2022] Kaack, L.H., Donti, P.L., Strubell, E., Kamiya, G., Creutzig, F., Rolnick, D.: Aligning artificial intelligence with climate change mitigation. Nature Climate Change 12(6), 518–527 (2022) Zhou et al. [2022] Zhou, H., Nova, A., Larochelle, H., Courville, A., Neyshabur, B., Sedghi, H.: Teaching algorithmic reasoning via in-context learning. arXiv preprint arXiv:2211.09066 (2022) Izacard et al. [2022] Izacard, G., Lewis, P., Lomeli, M., Hosseini, L., Petroni, F., Schick, T., Dwivedi-Yu, J., Joulin, A., Riedel, S., Grave, E.: Few-shot learning with retrieval augmented language models. arXiv preprint arXiv:2208.03299 (2022) Wei et al. [2022] Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E., Le, Q., Zhou, D.: Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903 (2022) Wang et al. [2023] Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-Consistency Improves Chain of Thought Reasoning in Language Models (2023) Shao et al. [2023] Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Strubell, E., Ganesh, A., McCallum, A.: Energy and policy considerations for modern deep learning research. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, pp. 13693–13696 (2020) Patterson et al. [2021] Patterson, D., Gonzalez, J., Le, Q., Liang, C., Munguia, L.-M., Rothchild, D., So, D., Texier, M., Dean, J.: Carbon emissions and large neural network training. arXiv preprint arXiv:2104.10350 (2021) Wu et al. [2022] Wu, C.-J., Raghavendra, R., Gupta, U., Acun, B., Ardalani, N., Maeng, K., Chang, G., Aga, F., Huang, J., Bai, C., et al.: Sustainable ai: Environmental implications, challenges and opportunities. Proceedings of Machine Learning and Systems 4, 795–813 (2022) Kaack et al. [2022] Kaack, L.H., Donti, P.L., Strubell, E., Kamiya, G., Creutzig, F., Rolnick, D.: Aligning artificial intelligence with climate change mitigation. Nature Climate Change 12(6), 518–527 (2022) Zhou et al. [2022] Zhou, H., Nova, A., Larochelle, H., Courville, A., Neyshabur, B., Sedghi, H.: Teaching algorithmic reasoning via in-context learning. arXiv preprint arXiv:2211.09066 (2022) Izacard et al. [2022] Izacard, G., Lewis, P., Lomeli, M., Hosseini, L., Petroni, F., Schick, T., Dwivedi-Yu, J., Joulin, A., Riedel, S., Grave, E.: Few-shot learning with retrieval augmented language models. arXiv preprint arXiv:2208.03299 (2022) Wei et al. [2022] Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E., Le, Q., Zhou, D.: Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903 (2022) Wang et al. [2023] Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-Consistency Improves Chain of Thought Reasoning in Language Models (2023) Shao et al. [2023] Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Patterson, D., Gonzalez, J., Le, Q., Liang, C., Munguia, L.-M., Rothchild, D., So, D., Texier, M., Dean, J.: Carbon emissions and large neural network training. arXiv preprint arXiv:2104.10350 (2021) Wu et al. [2022] Wu, C.-J., Raghavendra, R., Gupta, U., Acun, B., Ardalani, N., Maeng, K., Chang, G., Aga, F., Huang, J., Bai, C., et al.: Sustainable ai: Environmental implications, challenges and opportunities. Proceedings of Machine Learning and Systems 4, 795–813 (2022) Kaack et al. [2022] Kaack, L.H., Donti, P.L., Strubell, E., Kamiya, G., Creutzig, F., Rolnick, D.: Aligning artificial intelligence with climate change mitigation. Nature Climate Change 12(6), 518–527 (2022) Zhou et al. [2022] Zhou, H., Nova, A., Larochelle, H., Courville, A., Neyshabur, B., Sedghi, H.: Teaching algorithmic reasoning via in-context learning. arXiv preprint arXiv:2211.09066 (2022) Izacard et al. [2022] Izacard, G., Lewis, P., Lomeli, M., Hosseini, L., Petroni, F., Schick, T., Dwivedi-Yu, J., Joulin, A., Riedel, S., Grave, E.: Few-shot learning with retrieval augmented language models. arXiv preprint arXiv:2208.03299 (2022) Wei et al. [2022] Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E., Le, Q., Zhou, D.: Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903 (2022) Wang et al. [2023] Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-Consistency Improves Chain of Thought Reasoning in Language Models (2023) Shao et al. [2023] Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Wu, C.-J., Raghavendra, R., Gupta, U., Acun, B., Ardalani, N., Maeng, K., Chang, G., Aga, F., Huang, J., Bai, C., et al.: Sustainable ai: Environmental implications, challenges and opportunities. Proceedings of Machine Learning and Systems 4, 795–813 (2022) Kaack et al. [2022] Kaack, L.H., Donti, P.L., Strubell, E., Kamiya, G., Creutzig, F., Rolnick, D.: Aligning artificial intelligence with climate change mitigation. Nature Climate Change 12(6), 518–527 (2022) Zhou et al. [2022] Zhou, H., Nova, A., Larochelle, H., Courville, A., Neyshabur, B., Sedghi, H.: Teaching algorithmic reasoning via in-context learning. arXiv preprint arXiv:2211.09066 (2022) Izacard et al. [2022] Izacard, G., Lewis, P., Lomeli, M., Hosseini, L., Petroni, F., Schick, T., Dwivedi-Yu, J., Joulin, A., Riedel, S., Grave, E.: Few-shot learning with retrieval augmented language models. arXiv preprint arXiv:2208.03299 (2022) Wei et al. [2022] Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E., Le, Q., Zhou, D.: Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903 (2022) Wang et al. [2023] Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-Consistency Improves Chain of Thought Reasoning in Language Models (2023) Shao et al. [2023] Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Kaack, L.H., Donti, P.L., Strubell, E., Kamiya, G., Creutzig, F., Rolnick, D.: Aligning artificial intelligence with climate change mitigation. Nature Climate Change 12(6), 518–527 (2022) Zhou et al. [2022] Zhou, H., Nova, A., Larochelle, H., Courville, A., Neyshabur, B., Sedghi, H.: Teaching algorithmic reasoning via in-context learning. arXiv preprint arXiv:2211.09066 (2022) Izacard et al. [2022] Izacard, G., Lewis, P., Lomeli, M., Hosseini, L., Petroni, F., Schick, T., Dwivedi-Yu, J., Joulin, A., Riedel, S., Grave, E.: Few-shot learning with retrieval augmented language models. arXiv preprint arXiv:2208.03299 (2022) Wei et al. [2022] Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E., Le, Q., Zhou, D.: Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903 (2022) Wang et al. [2023] Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-Consistency Improves Chain of Thought Reasoning in Language Models (2023) Shao et al. [2023] Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Zhou, H., Nova, A., Larochelle, H., Courville, A., Neyshabur, B., Sedghi, H.: Teaching algorithmic reasoning via in-context learning. arXiv preprint arXiv:2211.09066 (2022) Izacard et al. [2022] Izacard, G., Lewis, P., Lomeli, M., Hosseini, L., Petroni, F., Schick, T., Dwivedi-Yu, J., Joulin, A., Riedel, S., Grave, E.: Few-shot learning with retrieval augmented language models. arXiv preprint arXiv:2208.03299 (2022) Wei et al. [2022] Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E., Le, Q., Zhou, D.: Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903 (2022) Wang et al. [2023] Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-Consistency Improves Chain of Thought Reasoning in Language Models (2023) Shao et al. [2023] Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Izacard, G., Lewis, P., Lomeli, M., Hosseini, L., Petroni, F., Schick, T., Dwivedi-Yu, J., Joulin, A., Riedel, S., Grave, E.: Few-shot learning with retrieval augmented language models. arXiv preprint arXiv:2208.03299 (2022) Wei et al. [2022] Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E., Le, Q., Zhou, D.: Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903 (2022) Wang et al. [2023] Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-Consistency Improves Chain of Thought Reasoning in Language Models (2023) Shao et al. [2023] Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E., Le, Q., Zhou, D.: Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903 (2022) Wang et al. [2023] Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-Consistency Improves Chain of Thought Reasoning in Language Models (2023) Shao et al. [2023] Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-Consistency Improves Chain of Thought Reasoning in Language Models (2023) Shao et al. [2023] Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023)
  6. O’Neill, M., Spector, L.: Automatic programming: The open issue? Genetic Programming and Evolvable Machines 21, 251–262 (2020) Liu et al. [2023] Liu, P., Yuan, W., Fu, J., Jiang, Z., Hayashi, H., Neubig, G.: Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing. ACM Computing Surveys 55(9), 1–35 (2023) Radford et al. [2019] Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., Sutskever, I., et al.: Language models are unsupervised multitask learners. OpenAI blog 1(8), 9 (2019) Brown et al. [2020] Brown, T.B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D.M., Wu, J., Winter, C., Hesse, C., Chen, M., Sigler, E., Litwin, M., Gray, S., Chess, B., Clark, J., Berner, C., McCandlish, S., Radford, A., Sutskever, I., Amodei, D.: Language Models are Few-Shot Learners (2020) OpenAI [2023] OpenAI: GPT-4 Technical Report (2023) Phuong and Hutter [2022] Phuong, M., Hutter, M.: Formal algorithms for transformers. arXiv preprint arXiv:2207.09238 (2022) Ji et al. [2023] Ji, Z., Lee, N., Frieske, R., Yu, T., Su, D., Xu, Y., Ishii, E., Bang, Y.J., Madotto, A., Fung, P.: Survey of hallucination in natural language generation. ACM Comput. Surv. 55(12) (2023) https://doi.org/10.1145/3571730 Strubell et al. [2020] Strubell, E., Ganesh, A., McCallum, A.: Energy and policy considerations for modern deep learning research. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, pp. 13693–13696 (2020) Patterson et al. [2021] Patterson, D., Gonzalez, J., Le, Q., Liang, C., Munguia, L.-M., Rothchild, D., So, D., Texier, M., Dean, J.: Carbon emissions and large neural network training. arXiv preprint arXiv:2104.10350 (2021) Wu et al. [2022] Wu, C.-J., Raghavendra, R., Gupta, U., Acun, B., Ardalani, N., Maeng, K., Chang, G., Aga, F., Huang, J., Bai, C., et al.: Sustainable ai: Environmental implications, challenges and opportunities. Proceedings of Machine Learning and Systems 4, 795–813 (2022) Kaack et al. [2022] Kaack, L.H., Donti, P.L., Strubell, E., Kamiya, G., Creutzig, F., Rolnick, D.: Aligning artificial intelligence with climate change mitigation. Nature Climate Change 12(6), 518–527 (2022) Zhou et al. [2022] Zhou, H., Nova, A., Larochelle, H., Courville, A., Neyshabur, B., Sedghi, H.: Teaching algorithmic reasoning via in-context learning. arXiv preprint arXiv:2211.09066 (2022) Izacard et al. [2022] Izacard, G., Lewis, P., Lomeli, M., Hosseini, L., Petroni, F., Schick, T., Dwivedi-Yu, J., Joulin, A., Riedel, S., Grave, E.: Few-shot learning with retrieval augmented language models. arXiv preprint arXiv:2208.03299 (2022) Wei et al. [2022] Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E., Le, Q., Zhou, D.: Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903 (2022) Wang et al. [2023] Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-Consistency Improves Chain of Thought Reasoning in Language Models (2023) Shao et al. [2023] Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Liu, P., Yuan, W., Fu, J., Jiang, Z., Hayashi, H., Neubig, G.: Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing. ACM Computing Surveys 55(9), 1–35 (2023) Radford et al. [2019] Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., Sutskever, I., et al.: Language models are unsupervised multitask learners. OpenAI blog 1(8), 9 (2019) Brown et al. [2020] Brown, T.B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D.M., Wu, J., Winter, C., Hesse, C., Chen, M., Sigler, E., Litwin, M., Gray, S., Chess, B., Clark, J., Berner, C., McCandlish, S., Radford, A., Sutskever, I., Amodei, D.: Language Models are Few-Shot Learners (2020) OpenAI [2023] OpenAI: GPT-4 Technical Report (2023) Phuong and Hutter [2022] Phuong, M., Hutter, M.: Formal algorithms for transformers. arXiv preprint arXiv:2207.09238 (2022) Ji et al. [2023] Ji, Z., Lee, N., Frieske, R., Yu, T., Su, D., Xu, Y., Ishii, E., Bang, Y.J., Madotto, A., Fung, P.: Survey of hallucination in natural language generation. ACM Comput. Surv. 55(12) (2023) https://doi.org/10.1145/3571730 Strubell et al. [2020] Strubell, E., Ganesh, A., McCallum, A.: Energy and policy considerations for modern deep learning research. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, pp. 13693–13696 (2020) Patterson et al. [2021] Patterson, D., Gonzalez, J., Le, Q., Liang, C., Munguia, L.-M., Rothchild, D., So, D., Texier, M., Dean, J.: Carbon emissions and large neural network training. arXiv preprint arXiv:2104.10350 (2021) Wu et al. [2022] Wu, C.-J., Raghavendra, R., Gupta, U., Acun, B., Ardalani, N., Maeng, K., Chang, G., Aga, F., Huang, J., Bai, C., et al.: Sustainable ai: Environmental implications, challenges and opportunities. Proceedings of Machine Learning and Systems 4, 795–813 (2022) Kaack et al. [2022] Kaack, L.H., Donti, P.L., Strubell, E., Kamiya, G., Creutzig, F., Rolnick, D.: Aligning artificial intelligence with climate change mitigation. Nature Climate Change 12(6), 518–527 (2022) Zhou et al. [2022] Zhou, H., Nova, A., Larochelle, H., Courville, A., Neyshabur, B., Sedghi, H.: Teaching algorithmic reasoning via in-context learning. arXiv preprint arXiv:2211.09066 (2022) Izacard et al. [2022] Izacard, G., Lewis, P., Lomeli, M., Hosseini, L., Petroni, F., Schick, T., Dwivedi-Yu, J., Joulin, A., Riedel, S., Grave, E.: Few-shot learning with retrieval augmented language models. arXiv preprint arXiv:2208.03299 (2022) Wei et al. [2022] Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E., Le, Q., Zhou, D.: Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903 (2022) Wang et al. [2023] Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-Consistency Improves Chain of Thought Reasoning in Language Models (2023) Shao et al. [2023] Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., Sutskever, I., et al.: Language models are unsupervised multitask learners. OpenAI blog 1(8), 9 (2019) Brown et al. [2020] Brown, T.B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D.M., Wu, J., Winter, C., Hesse, C., Chen, M., Sigler, E., Litwin, M., Gray, S., Chess, B., Clark, J., Berner, C., McCandlish, S., Radford, A., Sutskever, I., Amodei, D.: Language Models are Few-Shot Learners (2020) OpenAI [2023] OpenAI: GPT-4 Technical Report (2023) Phuong and Hutter [2022] Phuong, M., Hutter, M.: Formal algorithms for transformers. arXiv preprint arXiv:2207.09238 (2022) Ji et al. [2023] Ji, Z., Lee, N., Frieske, R., Yu, T., Su, D., Xu, Y., Ishii, E., Bang, Y.J., Madotto, A., Fung, P.: Survey of hallucination in natural language generation. ACM Comput. Surv. 55(12) (2023) https://doi.org/10.1145/3571730 Strubell et al. [2020] Strubell, E., Ganesh, A., McCallum, A.: Energy and policy considerations for modern deep learning research. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, pp. 13693–13696 (2020) Patterson et al. [2021] Patterson, D., Gonzalez, J., Le, Q., Liang, C., Munguia, L.-M., Rothchild, D., So, D., Texier, M., Dean, J.: Carbon emissions and large neural network training. arXiv preprint arXiv:2104.10350 (2021) Wu et al. [2022] Wu, C.-J., Raghavendra, R., Gupta, U., Acun, B., Ardalani, N., Maeng, K., Chang, G., Aga, F., Huang, J., Bai, C., et al.: Sustainable ai: Environmental implications, challenges and opportunities. Proceedings of Machine Learning and Systems 4, 795–813 (2022) Kaack et al. [2022] Kaack, L.H., Donti, P.L., Strubell, E., Kamiya, G., Creutzig, F., Rolnick, D.: Aligning artificial intelligence with climate change mitigation. Nature Climate Change 12(6), 518–527 (2022) Zhou et al. [2022] Zhou, H., Nova, A., Larochelle, H., Courville, A., Neyshabur, B., Sedghi, H.: Teaching algorithmic reasoning via in-context learning. arXiv preprint arXiv:2211.09066 (2022) Izacard et al. [2022] Izacard, G., Lewis, P., Lomeli, M., Hosseini, L., Petroni, F., Schick, T., Dwivedi-Yu, J., Joulin, A., Riedel, S., Grave, E.: Few-shot learning with retrieval augmented language models. arXiv preprint arXiv:2208.03299 (2022) Wei et al. [2022] Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E., Le, Q., Zhou, D.: Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903 (2022) Wang et al. [2023] Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-Consistency Improves Chain of Thought Reasoning in Language Models (2023) Shao et al. [2023] Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Brown, T.B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D.M., Wu, J., Winter, C., Hesse, C., Chen, M., Sigler, E., Litwin, M., Gray, S., Chess, B., Clark, J., Berner, C., McCandlish, S., Radford, A., Sutskever, I., Amodei, D.: Language Models are Few-Shot Learners (2020) OpenAI [2023] OpenAI: GPT-4 Technical Report (2023) Phuong and Hutter [2022] Phuong, M., Hutter, M.: Formal algorithms for transformers. arXiv preprint arXiv:2207.09238 (2022) Ji et al. [2023] Ji, Z., Lee, N., Frieske, R., Yu, T., Su, D., Xu, Y., Ishii, E., Bang, Y.J., Madotto, A., Fung, P.: Survey of hallucination in natural language generation. ACM Comput. Surv. 55(12) (2023) https://doi.org/10.1145/3571730 Strubell et al. [2020] Strubell, E., Ganesh, A., McCallum, A.: Energy and policy considerations for modern deep learning research. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, pp. 13693–13696 (2020) Patterson et al. [2021] Patterson, D., Gonzalez, J., Le, Q., Liang, C., Munguia, L.-M., Rothchild, D., So, D., Texier, M., Dean, J.: Carbon emissions and large neural network training. arXiv preprint arXiv:2104.10350 (2021) Wu et al. [2022] Wu, C.-J., Raghavendra, R., Gupta, U., Acun, B., Ardalani, N., Maeng, K., Chang, G., Aga, F., Huang, J., Bai, C., et al.: Sustainable ai: Environmental implications, challenges and opportunities. Proceedings of Machine Learning and Systems 4, 795–813 (2022) Kaack et al. [2022] Kaack, L.H., Donti, P.L., Strubell, E., Kamiya, G., Creutzig, F., Rolnick, D.: Aligning artificial intelligence with climate change mitigation. Nature Climate Change 12(6), 518–527 (2022) Zhou et al. [2022] Zhou, H., Nova, A., Larochelle, H., Courville, A., Neyshabur, B., Sedghi, H.: Teaching algorithmic reasoning via in-context learning. arXiv preprint arXiv:2211.09066 (2022) Izacard et al. [2022] Izacard, G., Lewis, P., Lomeli, M., Hosseini, L., Petroni, F., Schick, T., Dwivedi-Yu, J., Joulin, A., Riedel, S., Grave, E.: Few-shot learning with retrieval augmented language models. arXiv preprint arXiv:2208.03299 (2022) Wei et al. [2022] Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E., Le, Q., Zhou, D.: Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903 (2022) Wang et al. [2023] Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-Consistency Improves Chain of Thought Reasoning in Language Models (2023) Shao et al. [2023] Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) OpenAI: GPT-4 Technical Report (2023) Phuong and Hutter [2022] Phuong, M., Hutter, M.: Formal algorithms for transformers. arXiv preprint arXiv:2207.09238 (2022) Ji et al. [2023] Ji, Z., Lee, N., Frieske, R., Yu, T., Su, D., Xu, Y., Ishii, E., Bang, Y.J., Madotto, A., Fung, P.: Survey of hallucination in natural language generation. ACM Comput. Surv. 55(12) (2023) https://doi.org/10.1145/3571730 Strubell et al. [2020] Strubell, E., Ganesh, A., McCallum, A.: Energy and policy considerations for modern deep learning research. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, pp. 13693–13696 (2020) Patterson et al. [2021] Patterson, D., Gonzalez, J., Le, Q., Liang, C., Munguia, L.-M., Rothchild, D., So, D., Texier, M., Dean, J.: Carbon emissions and large neural network training. arXiv preprint arXiv:2104.10350 (2021) Wu et al. [2022] Wu, C.-J., Raghavendra, R., Gupta, U., Acun, B., Ardalani, N., Maeng, K., Chang, G., Aga, F., Huang, J., Bai, C., et al.: Sustainable ai: Environmental implications, challenges and opportunities. Proceedings of Machine Learning and Systems 4, 795–813 (2022) Kaack et al. [2022] Kaack, L.H., Donti, P.L., Strubell, E., Kamiya, G., Creutzig, F., Rolnick, D.: Aligning artificial intelligence with climate change mitigation. Nature Climate Change 12(6), 518–527 (2022) Zhou et al. [2022] Zhou, H., Nova, A., Larochelle, H., Courville, A., Neyshabur, B., Sedghi, H.: Teaching algorithmic reasoning via in-context learning. arXiv preprint arXiv:2211.09066 (2022) Izacard et al. [2022] Izacard, G., Lewis, P., Lomeli, M., Hosseini, L., Petroni, F., Schick, T., Dwivedi-Yu, J., Joulin, A., Riedel, S., Grave, E.: Few-shot learning with retrieval augmented language models. arXiv preprint arXiv:2208.03299 (2022) Wei et al. [2022] Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E., Le, Q., Zhou, D.: Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903 (2022) Wang et al. [2023] Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-Consistency Improves Chain of Thought Reasoning in Language Models (2023) Shao et al. [2023] Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Phuong, M., Hutter, M.: Formal algorithms for transformers. arXiv preprint arXiv:2207.09238 (2022) Ji et al. [2023] Ji, Z., Lee, N., Frieske, R., Yu, T., Su, D., Xu, Y., Ishii, E., Bang, Y.J., Madotto, A., Fung, P.: Survey of hallucination in natural language generation. ACM Comput. Surv. 55(12) (2023) https://doi.org/10.1145/3571730 Strubell et al. [2020] Strubell, E., Ganesh, A., McCallum, A.: Energy and policy considerations for modern deep learning research. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, pp. 13693–13696 (2020) Patterson et al. [2021] Patterson, D., Gonzalez, J., Le, Q., Liang, C., Munguia, L.-M., Rothchild, D., So, D., Texier, M., Dean, J.: Carbon emissions and large neural network training. arXiv preprint arXiv:2104.10350 (2021) Wu et al. [2022] Wu, C.-J., Raghavendra, R., Gupta, U., Acun, B., Ardalani, N., Maeng, K., Chang, G., Aga, F., Huang, J., Bai, C., et al.: Sustainable ai: Environmental implications, challenges and opportunities. Proceedings of Machine Learning and Systems 4, 795–813 (2022) Kaack et al. [2022] Kaack, L.H., Donti, P.L., Strubell, E., Kamiya, G., Creutzig, F., Rolnick, D.: Aligning artificial intelligence with climate change mitigation. Nature Climate Change 12(6), 518–527 (2022) Zhou et al. [2022] Zhou, H., Nova, A., Larochelle, H., Courville, A., Neyshabur, B., Sedghi, H.: Teaching algorithmic reasoning via in-context learning. arXiv preprint arXiv:2211.09066 (2022) Izacard et al. [2022] Izacard, G., Lewis, P., Lomeli, M., Hosseini, L., Petroni, F., Schick, T., Dwivedi-Yu, J., Joulin, A., Riedel, S., Grave, E.: Few-shot learning with retrieval augmented language models. arXiv preprint arXiv:2208.03299 (2022) Wei et al. [2022] Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E., Le, Q., Zhou, D.: Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903 (2022) Wang et al. [2023] Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-Consistency Improves Chain of Thought Reasoning in Language Models (2023) Shao et al. [2023] Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Ji, Z., Lee, N., Frieske, R., Yu, T., Su, D., Xu, Y., Ishii, E., Bang, Y.J., Madotto, A., Fung, P.: Survey of hallucination in natural language generation. ACM Comput. Surv. 55(12) (2023) https://doi.org/10.1145/3571730 Strubell et al. [2020] Strubell, E., Ganesh, A., McCallum, A.: Energy and policy considerations for modern deep learning research. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, pp. 13693–13696 (2020) Patterson et al. [2021] Patterson, D., Gonzalez, J., Le, Q., Liang, C., Munguia, L.-M., Rothchild, D., So, D., Texier, M., Dean, J.: Carbon emissions and large neural network training. arXiv preprint arXiv:2104.10350 (2021) Wu et al. [2022] Wu, C.-J., Raghavendra, R., Gupta, U., Acun, B., Ardalani, N., Maeng, K., Chang, G., Aga, F., Huang, J., Bai, C., et al.: Sustainable ai: Environmental implications, challenges and opportunities. Proceedings of Machine Learning and Systems 4, 795–813 (2022) Kaack et al. [2022] Kaack, L.H., Donti, P.L., Strubell, E., Kamiya, G., Creutzig, F., Rolnick, D.: Aligning artificial intelligence with climate change mitigation. Nature Climate Change 12(6), 518–527 (2022) Zhou et al. [2022] Zhou, H., Nova, A., Larochelle, H., Courville, A., Neyshabur, B., Sedghi, H.: Teaching algorithmic reasoning via in-context learning. arXiv preprint arXiv:2211.09066 (2022) Izacard et al. [2022] Izacard, G., Lewis, P., Lomeli, M., Hosseini, L., Petroni, F., Schick, T., Dwivedi-Yu, J., Joulin, A., Riedel, S., Grave, E.: Few-shot learning with retrieval augmented language models. arXiv preprint arXiv:2208.03299 (2022) Wei et al. [2022] Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E., Le, Q., Zhou, D.: Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903 (2022) Wang et al. [2023] Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-Consistency Improves Chain of Thought Reasoning in Language Models (2023) Shao et al. [2023] Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Strubell, E., Ganesh, A., McCallum, A.: Energy and policy considerations for modern deep learning research. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, pp. 13693–13696 (2020) Patterson et al. [2021] Patterson, D., Gonzalez, J., Le, Q., Liang, C., Munguia, L.-M., Rothchild, D., So, D., Texier, M., Dean, J.: Carbon emissions and large neural network training. arXiv preprint arXiv:2104.10350 (2021) Wu et al. [2022] Wu, C.-J., Raghavendra, R., Gupta, U., Acun, B., Ardalani, N., Maeng, K., Chang, G., Aga, F., Huang, J., Bai, C., et al.: Sustainable ai: Environmental implications, challenges and opportunities. Proceedings of Machine Learning and Systems 4, 795–813 (2022) Kaack et al. [2022] Kaack, L.H., Donti, P.L., Strubell, E., Kamiya, G., Creutzig, F., Rolnick, D.: Aligning artificial intelligence with climate change mitigation. Nature Climate Change 12(6), 518–527 (2022) Zhou et al. [2022] Zhou, H., Nova, A., Larochelle, H., Courville, A., Neyshabur, B., Sedghi, H.: Teaching algorithmic reasoning via in-context learning. arXiv preprint arXiv:2211.09066 (2022) Izacard et al. [2022] Izacard, G., Lewis, P., Lomeli, M., Hosseini, L., Petroni, F., Schick, T., Dwivedi-Yu, J., Joulin, A., Riedel, S., Grave, E.: Few-shot learning with retrieval augmented language models. arXiv preprint arXiv:2208.03299 (2022) Wei et al. [2022] Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E., Le, Q., Zhou, D.: Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903 (2022) Wang et al. [2023] Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-Consistency Improves Chain of Thought Reasoning in Language Models (2023) Shao et al. [2023] Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Patterson, D., Gonzalez, J., Le, Q., Liang, C., Munguia, L.-M., Rothchild, D., So, D., Texier, M., Dean, J.: Carbon emissions and large neural network training. arXiv preprint arXiv:2104.10350 (2021) Wu et al. [2022] Wu, C.-J., Raghavendra, R., Gupta, U., Acun, B., Ardalani, N., Maeng, K., Chang, G., Aga, F., Huang, J., Bai, C., et al.: Sustainable ai: Environmental implications, challenges and opportunities. Proceedings of Machine Learning and Systems 4, 795–813 (2022) Kaack et al. [2022] Kaack, L.H., Donti, P.L., Strubell, E., Kamiya, G., Creutzig, F., Rolnick, D.: Aligning artificial intelligence with climate change mitigation. Nature Climate Change 12(6), 518–527 (2022) Zhou et al. [2022] Zhou, H., Nova, A., Larochelle, H., Courville, A., Neyshabur, B., Sedghi, H.: Teaching algorithmic reasoning via in-context learning. arXiv preprint arXiv:2211.09066 (2022) Izacard et al. [2022] Izacard, G., Lewis, P., Lomeli, M., Hosseini, L., Petroni, F., Schick, T., Dwivedi-Yu, J., Joulin, A., Riedel, S., Grave, E.: Few-shot learning with retrieval augmented language models. arXiv preprint arXiv:2208.03299 (2022) Wei et al. [2022] Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E., Le, Q., Zhou, D.: Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903 (2022) Wang et al. [2023] Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-Consistency Improves Chain of Thought Reasoning in Language Models (2023) Shao et al. [2023] Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Wu, C.-J., Raghavendra, R., Gupta, U., Acun, B., Ardalani, N., Maeng, K., Chang, G., Aga, F., Huang, J., Bai, C., et al.: Sustainable ai: Environmental implications, challenges and opportunities. Proceedings of Machine Learning and Systems 4, 795–813 (2022) Kaack et al. [2022] Kaack, L.H., Donti, P.L., Strubell, E., Kamiya, G., Creutzig, F., Rolnick, D.: Aligning artificial intelligence with climate change mitigation. Nature Climate Change 12(6), 518–527 (2022) Zhou et al. [2022] Zhou, H., Nova, A., Larochelle, H., Courville, A., Neyshabur, B., Sedghi, H.: Teaching algorithmic reasoning via in-context learning. arXiv preprint arXiv:2211.09066 (2022) Izacard et al. [2022] Izacard, G., Lewis, P., Lomeli, M., Hosseini, L., Petroni, F., Schick, T., Dwivedi-Yu, J., Joulin, A., Riedel, S., Grave, E.: Few-shot learning with retrieval augmented language models. arXiv preprint arXiv:2208.03299 (2022) Wei et al. [2022] Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E., Le, Q., Zhou, D.: Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903 (2022) Wang et al. [2023] Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-Consistency Improves Chain of Thought Reasoning in Language Models (2023) Shao et al. [2023] Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Kaack, L.H., Donti, P.L., Strubell, E., Kamiya, G., Creutzig, F., Rolnick, D.: Aligning artificial intelligence with climate change mitigation. Nature Climate Change 12(6), 518–527 (2022) Zhou et al. [2022] Zhou, H., Nova, A., Larochelle, H., Courville, A., Neyshabur, B., Sedghi, H.: Teaching algorithmic reasoning via in-context learning. arXiv preprint arXiv:2211.09066 (2022) Izacard et al. [2022] Izacard, G., Lewis, P., Lomeli, M., Hosseini, L., Petroni, F., Schick, T., Dwivedi-Yu, J., Joulin, A., Riedel, S., Grave, E.: Few-shot learning with retrieval augmented language models. arXiv preprint arXiv:2208.03299 (2022) Wei et al. [2022] Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E., Le, Q., Zhou, D.: Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903 (2022) Wang et al. [2023] Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-Consistency Improves Chain of Thought Reasoning in Language Models (2023) Shao et al. [2023] Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Zhou, H., Nova, A., Larochelle, H., Courville, A., Neyshabur, B., Sedghi, H.: Teaching algorithmic reasoning via in-context learning. arXiv preprint arXiv:2211.09066 (2022) Izacard et al. [2022] Izacard, G., Lewis, P., Lomeli, M., Hosseini, L., Petroni, F., Schick, T., Dwivedi-Yu, J., Joulin, A., Riedel, S., Grave, E.: Few-shot learning with retrieval augmented language models. arXiv preprint arXiv:2208.03299 (2022) Wei et al. [2022] Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E., Le, Q., Zhou, D.: Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903 (2022) Wang et al. [2023] Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-Consistency Improves Chain of Thought Reasoning in Language Models (2023) Shao et al. [2023] Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Izacard, G., Lewis, P., Lomeli, M., Hosseini, L., Petroni, F., Schick, T., Dwivedi-Yu, J., Joulin, A., Riedel, S., Grave, E.: Few-shot learning with retrieval augmented language models. arXiv preprint arXiv:2208.03299 (2022) Wei et al. [2022] Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E., Le, Q., Zhou, D.: Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903 (2022) Wang et al. [2023] Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-Consistency Improves Chain of Thought Reasoning in Language Models (2023) Shao et al. [2023] Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E., Le, Q., Zhou, D.: Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903 (2022) Wang et al. [2023] Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-Consistency Improves Chain of Thought Reasoning in Language Models (2023) Shao et al. [2023] Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-Consistency Improves Chain of Thought Reasoning in Language Models (2023) Shao et al. [2023] Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023)
  7. Liu, P., Yuan, W., Fu, J., Jiang, Z., Hayashi, H., Neubig, G.: Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing. ACM Computing Surveys 55(9), 1–35 (2023) Radford et al. [2019] Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., Sutskever, I., et al.: Language models are unsupervised multitask learners. OpenAI blog 1(8), 9 (2019) Brown et al. [2020] Brown, T.B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D.M., Wu, J., Winter, C., Hesse, C., Chen, M., Sigler, E., Litwin, M., Gray, S., Chess, B., Clark, J., Berner, C., McCandlish, S., Radford, A., Sutskever, I., Amodei, D.: Language Models are Few-Shot Learners (2020) OpenAI [2023] OpenAI: GPT-4 Technical Report (2023) Phuong and Hutter [2022] Phuong, M., Hutter, M.: Formal algorithms for transformers. arXiv preprint arXiv:2207.09238 (2022) Ji et al. [2023] Ji, Z., Lee, N., Frieske, R., Yu, T., Su, D., Xu, Y., Ishii, E., Bang, Y.J., Madotto, A., Fung, P.: Survey of hallucination in natural language generation. ACM Comput. Surv. 55(12) (2023) https://doi.org/10.1145/3571730 Strubell et al. [2020] Strubell, E., Ganesh, A., McCallum, A.: Energy and policy considerations for modern deep learning research. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, pp. 13693–13696 (2020) Patterson et al. [2021] Patterson, D., Gonzalez, J., Le, Q., Liang, C., Munguia, L.-M., Rothchild, D., So, D., Texier, M., Dean, J.: Carbon emissions and large neural network training. arXiv preprint arXiv:2104.10350 (2021) Wu et al. [2022] Wu, C.-J., Raghavendra, R., Gupta, U., Acun, B., Ardalani, N., Maeng, K., Chang, G., Aga, F., Huang, J., Bai, C., et al.: Sustainable ai: Environmental implications, challenges and opportunities. Proceedings of Machine Learning and Systems 4, 795–813 (2022) Kaack et al. [2022] Kaack, L.H., Donti, P.L., Strubell, E., Kamiya, G., Creutzig, F., Rolnick, D.: Aligning artificial intelligence with climate change mitigation. Nature Climate Change 12(6), 518–527 (2022) Zhou et al. [2022] Zhou, H., Nova, A., Larochelle, H., Courville, A., Neyshabur, B., Sedghi, H.: Teaching algorithmic reasoning via in-context learning. arXiv preprint arXiv:2211.09066 (2022) Izacard et al. [2022] Izacard, G., Lewis, P., Lomeli, M., Hosseini, L., Petroni, F., Schick, T., Dwivedi-Yu, J., Joulin, A., Riedel, S., Grave, E.: Few-shot learning with retrieval augmented language models. arXiv preprint arXiv:2208.03299 (2022) Wei et al. [2022] Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E., Le, Q., Zhou, D.: Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903 (2022) Wang et al. [2023] Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-Consistency Improves Chain of Thought Reasoning in Language Models (2023) Shao et al. [2023] Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., Sutskever, I., et al.: Language models are unsupervised multitask learners. OpenAI blog 1(8), 9 (2019) Brown et al. [2020] Brown, T.B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D.M., Wu, J., Winter, C., Hesse, C., Chen, M., Sigler, E., Litwin, M., Gray, S., Chess, B., Clark, J., Berner, C., McCandlish, S., Radford, A., Sutskever, I., Amodei, D.: Language Models are Few-Shot Learners (2020) OpenAI [2023] OpenAI: GPT-4 Technical Report (2023) Phuong and Hutter [2022] Phuong, M., Hutter, M.: Formal algorithms for transformers. arXiv preprint arXiv:2207.09238 (2022) Ji et al. [2023] Ji, Z., Lee, N., Frieske, R., Yu, T., Su, D., Xu, Y., Ishii, E., Bang, Y.J., Madotto, A., Fung, P.: Survey of hallucination in natural language generation. ACM Comput. Surv. 55(12) (2023) https://doi.org/10.1145/3571730 Strubell et al. [2020] Strubell, E., Ganesh, A., McCallum, A.: Energy and policy considerations for modern deep learning research. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, pp. 13693–13696 (2020) Patterson et al. [2021] Patterson, D., Gonzalez, J., Le, Q., Liang, C., Munguia, L.-M., Rothchild, D., So, D., Texier, M., Dean, J.: Carbon emissions and large neural network training. arXiv preprint arXiv:2104.10350 (2021) Wu et al. [2022] Wu, C.-J., Raghavendra, R., Gupta, U., Acun, B., Ardalani, N., Maeng, K., Chang, G., Aga, F., Huang, J., Bai, C., et al.: Sustainable ai: Environmental implications, challenges and opportunities. Proceedings of Machine Learning and Systems 4, 795–813 (2022) Kaack et al. [2022] Kaack, L.H., Donti, P.L., Strubell, E., Kamiya, G., Creutzig, F., Rolnick, D.: Aligning artificial intelligence with climate change mitigation. Nature Climate Change 12(6), 518–527 (2022) Zhou et al. [2022] Zhou, H., Nova, A., Larochelle, H., Courville, A., Neyshabur, B., Sedghi, H.: Teaching algorithmic reasoning via in-context learning. arXiv preprint arXiv:2211.09066 (2022) Izacard et al. [2022] Izacard, G., Lewis, P., Lomeli, M., Hosseini, L., Petroni, F., Schick, T., Dwivedi-Yu, J., Joulin, A., Riedel, S., Grave, E.: Few-shot learning with retrieval augmented language models. arXiv preprint arXiv:2208.03299 (2022) Wei et al. [2022] Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E., Le, Q., Zhou, D.: Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903 (2022) Wang et al. [2023] Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-Consistency Improves Chain of Thought Reasoning in Language Models (2023) Shao et al. [2023] Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Brown, T.B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D.M., Wu, J., Winter, C., Hesse, C., Chen, M., Sigler, E., Litwin, M., Gray, S., Chess, B., Clark, J., Berner, C., McCandlish, S., Radford, A., Sutskever, I., Amodei, D.: Language Models are Few-Shot Learners (2020) OpenAI [2023] OpenAI: GPT-4 Technical Report (2023) Phuong and Hutter [2022] Phuong, M., Hutter, M.: Formal algorithms for transformers. arXiv preprint arXiv:2207.09238 (2022) Ji et al. [2023] Ji, Z., Lee, N., Frieske, R., Yu, T., Su, D., Xu, Y., Ishii, E., Bang, Y.J., Madotto, A., Fung, P.: Survey of hallucination in natural language generation. ACM Comput. Surv. 55(12) (2023) https://doi.org/10.1145/3571730 Strubell et al. [2020] Strubell, E., Ganesh, A., McCallum, A.: Energy and policy considerations for modern deep learning research. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, pp. 13693–13696 (2020) Patterson et al. [2021] Patterson, D., Gonzalez, J., Le, Q., Liang, C., Munguia, L.-M., Rothchild, D., So, D., Texier, M., Dean, J.: Carbon emissions and large neural network training. arXiv preprint arXiv:2104.10350 (2021) Wu et al. [2022] Wu, C.-J., Raghavendra, R., Gupta, U., Acun, B., Ardalani, N., Maeng, K., Chang, G., Aga, F., Huang, J., Bai, C., et al.: Sustainable ai: Environmental implications, challenges and opportunities. Proceedings of Machine Learning and Systems 4, 795–813 (2022) Kaack et al. [2022] Kaack, L.H., Donti, P.L., Strubell, E., Kamiya, G., Creutzig, F., Rolnick, D.: Aligning artificial intelligence with climate change mitigation. Nature Climate Change 12(6), 518–527 (2022) Zhou et al. [2022] Zhou, H., Nova, A., Larochelle, H., Courville, A., Neyshabur, B., Sedghi, H.: Teaching algorithmic reasoning via in-context learning. arXiv preprint arXiv:2211.09066 (2022) Izacard et al. [2022] Izacard, G., Lewis, P., Lomeli, M., Hosseini, L., Petroni, F., Schick, T., Dwivedi-Yu, J., Joulin, A., Riedel, S., Grave, E.: Few-shot learning with retrieval augmented language models. arXiv preprint arXiv:2208.03299 (2022) Wei et al. [2022] Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E., Le, Q., Zhou, D.: Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903 (2022) Wang et al. [2023] Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-Consistency Improves Chain of Thought Reasoning in Language Models (2023) Shao et al. [2023] Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) OpenAI: GPT-4 Technical Report (2023) Phuong and Hutter [2022] Phuong, M., Hutter, M.: Formal algorithms for transformers. arXiv preprint arXiv:2207.09238 (2022) Ji et al. [2023] Ji, Z., Lee, N., Frieske, R., Yu, T., Su, D., Xu, Y., Ishii, E., Bang, Y.J., Madotto, A., Fung, P.: Survey of hallucination in natural language generation. ACM Comput. Surv. 55(12) (2023) https://doi.org/10.1145/3571730 Strubell et al. [2020] Strubell, E., Ganesh, A., McCallum, A.: Energy and policy considerations for modern deep learning research. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, pp. 13693–13696 (2020) Patterson et al. [2021] Patterson, D., Gonzalez, J., Le, Q., Liang, C., Munguia, L.-M., Rothchild, D., So, D., Texier, M., Dean, J.: Carbon emissions and large neural network training. arXiv preprint arXiv:2104.10350 (2021) Wu et al. [2022] Wu, C.-J., Raghavendra, R., Gupta, U., Acun, B., Ardalani, N., Maeng, K., Chang, G., Aga, F., Huang, J., Bai, C., et al.: Sustainable ai: Environmental implications, challenges and opportunities. Proceedings of Machine Learning and Systems 4, 795–813 (2022) Kaack et al. [2022] Kaack, L.H., Donti, P.L., Strubell, E., Kamiya, G., Creutzig, F., Rolnick, D.: Aligning artificial intelligence with climate change mitigation. Nature Climate Change 12(6), 518–527 (2022) Zhou et al. [2022] Zhou, H., Nova, A., Larochelle, H., Courville, A., Neyshabur, B., Sedghi, H.: Teaching algorithmic reasoning via in-context learning. arXiv preprint arXiv:2211.09066 (2022) Izacard et al. [2022] Izacard, G., Lewis, P., Lomeli, M., Hosseini, L., Petroni, F., Schick, T., Dwivedi-Yu, J., Joulin, A., Riedel, S., Grave, E.: Few-shot learning with retrieval augmented language models. arXiv preprint arXiv:2208.03299 (2022) Wei et al. [2022] Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E., Le, Q., Zhou, D.: Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903 (2022) Wang et al. [2023] Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-Consistency Improves Chain of Thought Reasoning in Language Models (2023) Shao et al. [2023] Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Phuong, M., Hutter, M.: Formal algorithms for transformers. arXiv preprint arXiv:2207.09238 (2022) Ji et al. [2023] Ji, Z., Lee, N., Frieske, R., Yu, T., Su, D., Xu, Y., Ishii, E., Bang, Y.J., Madotto, A., Fung, P.: Survey of hallucination in natural language generation. ACM Comput. Surv. 55(12) (2023) https://doi.org/10.1145/3571730 Strubell et al. [2020] Strubell, E., Ganesh, A., McCallum, A.: Energy and policy considerations for modern deep learning research. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, pp. 13693–13696 (2020) Patterson et al. [2021] Patterson, D., Gonzalez, J., Le, Q., Liang, C., Munguia, L.-M., Rothchild, D., So, D., Texier, M., Dean, J.: Carbon emissions and large neural network training. arXiv preprint arXiv:2104.10350 (2021) Wu et al. [2022] Wu, C.-J., Raghavendra, R., Gupta, U., Acun, B., Ardalani, N., Maeng, K., Chang, G., Aga, F., Huang, J., Bai, C., et al.: Sustainable ai: Environmental implications, challenges and opportunities. Proceedings of Machine Learning and Systems 4, 795–813 (2022) Kaack et al. [2022] Kaack, L.H., Donti, P.L., Strubell, E., Kamiya, G., Creutzig, F., Rolnick, D.: Aligning artificial intelligence with climate change mitigation. Nature Climate Change 12(6), 518–527 (2022) Zhou et al. [2022] Zhou, H., Nova, A., Larochelle, H., Courville, A., Neyshabur, B., Sedghi, H.: Teaching algorithmic reasoning via in-context learning. arXiv preprint arXiv:2211.09066 (2022) Izacard et al. [2022] Izacard, G., Lewis, P., Lomeli, M., Hosseini, L., Petroni, F., Schick, T., Dwivedi-Yu, J., Joulin, A., Riedel, S., Grave, E.: Few-shot learning with retrieval augmented language models. arXiv preprint arXiv:2208.03299 (2022) Wei et al. [2022] Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E., Le, Q., Zhou, D.: Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903 (2022) Wang et al. [2023] Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-Consistency Improves Chain of Thought Reasoning in Language Models (2023) Shao et al. [2023] Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Ji, Z., Lee, N., Frieske, R., Yu, T., Su, D., Xu, Y., Ishii, E., Bang, Y.J., Madotto, A., Fung, P.: Survey of hallucination in natural language generation. ACM Comput. Surv. 55(12) (2023) https://doi.org/10.1145/3571730 Strubell et al. [2020] Strubell, E., Ganesh, A., McCallum, A.: Energy and policy considerations for modern deep learning research. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, pp. 13693–13696 (2020) Patterson et al. [2021] Patterson, D., Gonzalez, J., Le, Q., Liang, C., Munguia, L.-M., Rothchild, D., So, D., Texier, M., Dean, J.: Carbon emissions and large neural network training. arXiv preprint arXiv:2104.10350 (2021) Wu et al. [2022] Wu, C.-J., Raghavendra, R., Gupta, U., Acun, B., Ardalani, N., Maeng, K., Chang, G., Aga, F., Huang, J., Bai, C., et al.: Sustainable ai: Environmental implications, challenges and opportunities. Proceedings of Machine Learning and Systems 4, 795–813 (2022) Kaack et al. [2022] Kaack, L.H., Donti, P.L., Strubell, E., Kamiya, G., Creutzig, F., Rolnick, D.: Aligning artificial intelligence with climate change mitigation. Nature Climate Change 12(6), 518–527 (2022) Zhou et al. [2022] Zhou, H., Nova, A., Larochelle, H., Courville, A., Neyshabur, B., Sedghi, H.: Teaching algorithmic reasoning via in-context learning. arXiv preprint arXiv:2211.09066 (2022) Izacard et al. [2022] Izacard, G., Lewis, P., Lomeli, M., Hosseini, L., Petroni, F., Schick, T., Dwivedi-Yu, J., Joulin, A., Riedel, S., Grave, E.: Few-shot learning with retrieval augmented language models. arXiv preprint arXiv:2208.03299 (2022) Wei et al. [2022] Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E., Le, Q., Zhou, D.: Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903 (2022) Wang et al. [2023] Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-Consistency Improves Chain of Thought Reasoning in Language Models (2023) Shao et al. [2023] Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Strubell, E., Ganesh, A., McCallum, A.: Energy and policy considerations for modern deep learning research. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, pp. 13693–13696 (2020) Patterson et al. [2021] Patterson, D., Gonzalez, J., Le, Q., Liang, C., Munguia, L.-M., Rothchild, D., So, D., Texier, M., Dean, J.: Carbon emissions and large neural network training. arXiv preprint arXiv:2104.10350 (2021) Wu et al. [2022] Wu, C.-J., Raghavendra, R., Gupta, U., Acun, B., Ardalani, N., Maeng, K., Chang, G., Aga, F., Huang, J., Bai, C., et al.: Sustainable ai: Environmental implications, challenges and opportunities. Proceedings of Machine Learning and Systems 4, 795–813 (2022) Kaack et al. [2022] Kaack, L.H., Donti, P.L., Strubell, E., Kamiya, G., Creutzig, F., Rolnick, D.: Aligning artificial intelligence with climate change mitigation. Nature Climate Change 12(6), 518–527 (2022) Zhou et al. [2022] Zhou, H., Nova, A., Larochelle, H., Courville, A., Neyshabur, B., Sedghi, H.: Teaching algorithmic reasoning via in-context learning. arXiv preprint arXiv:2211.09066 (2022) Izacard et al. [2022] Izacard, G., Lewis, P., Lomeli, M., Hosseini, L., Petroni, F., Schick, T., Dwivedi-Yu, J., Joulin, A., Riedel, S., Grave, E.: Few-shot learning with retrieval augmented language models. arXiv preprint arXiv:2208.03299 (2022) Wei et al. [2022] Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E., Le, Q., Zhou, D.: Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903 (2022) Wang et al. [2023] Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-Consistency Improves Chain of Thought Reasoning in Language Models (2023) Shao et al. [2023] Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Patterson, D., Gonzalez, J., Le, Q., Liang, C., Munguia, L.-M., Rothchild, D., So, D., Texier, M., Dean, J.: Carbon emissions and large neural network training. arXiv preprint arXiv:2104.10350 (2021) Wu et al. [2022] Wu, C.-J., Raghavendra, R., Gupta, U., Acun, B., Ardalani, N., Maeng, K., Chang, G., Aga, F., Huang, J., Bai, C., et al.: Sustainable ai: Environmental implications, challenges and opportunities. Proceedings of Machine Learning and Systems 4, 795–813 (2022) Kaack et al. [2022] Kaack, L.H., Donti, P.L., Strubell, E., Kamiya, G., Creutzig, F., Rolnick, D.: Aligning artificial intelligence with climate change mitigation. Nature Climate Change 12(6), 518–527 (2022) Zhou et al. [2022] Zhou, H., Nova, A., Larochelle, H., Courville, A., Neyshabur, B., Sedghi, H.: Teaching algorithmic reasoning via in-context learning. arXiv preprint arXiv:2211.09066 (2022) Izacard et al. [2022] Izacard, G., Lewis, P., Lomeli, M., Hosseini, L., Petroni, F., Schick, T., Dwivedi-Yu, J., Joulin, A., Riedel, S., Grave, E.: Few-shot learning with retrieval augmented language models. arXiv preprint arXiv:2208.03299 (2022) Wei et al. [2022] Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E., Le, Q., Zhou, D.: Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903 (2022) Wang et al. [2023] Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-Consistency Improves Chain of Thought Reasoning in Language Models (2023) Shao et al. [2023] Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Wu, C.-J., Raghavendra, R., Gupta, U., Acun, B., Ardalani, N., Maeng, K., Chang, G., Aga, F., Huang, J., Bai, C., et al.: Sustainable ai: Environmental implications, challenges and opportunities. Proceedings of Machine Learning and Systems 4, 795–813 (2022) Kaack et al. [2022] Kaack, L.H., Donti, P.L., Strubell, E., Kamiya, G., Creutzig, F., Rolnick, D.: Aligning artificial intelligence with climate change mitigation. Nature Climate Change 12(6), 518–527 (2022) Zhou et al. [2022] Zhou, H., Nova, A., Larochelle, H., Courville, A., Neyshabur, B., Sedghi, H.: Teaching algorithmic reasoning via in-context learning. arXiv preprint arXiv:2211.09066 (2022) Izacard et al. [2022] Izacard, G., Lewis, P., Lomeli, M., Hosseini, L., Petroni, F., Schick, T., Dwivedi-Yu, J., Joulin, A., Riedel, S., Grave, E.: Few-shot learning with retrieval augmented language models. arXiv preprint arXiv:2208.03299 (2022) Wei et al. [2022] Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E., Le, Q., Zhou, D.: Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903 (2022) Wang et al. [2023] Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-Consistency Improves Chain of Thought Reasoning in Language Models (2023) Shao et al. [2023] Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Kaack, L.H., Donti, P.L., Strubell, E., Kamiya, G., Creutzig, F., Rolnick, D.: Aligning artificial intelligence with climate change mitigation. Nature Climate Change 12(6), 518–527 (2022) Zhou et al. [2022] Zhou, H., Nova, A., Larochelle, H., Courville, A., Neyshabur, B., Sedghi, H.: Teaching algorithmic reasoning via in-context learning. arXiv preprint arXiv:2211.09066 (2022) Izacard et al. [2022] Izacard, G., Lewis, P., Lomeli, M., Hosseini, L., Petroni, F., Schick, T., Dwivedi-Yu, J., Joulin, A., Riedel, S., Grave, E.: Few-shot learning with retrieval augmented language models. arXiv preprint arXiv:2208.03299 (2022) Wei et al. [2022] Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E., Le, Q., Zhou, D.: Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903 (2022) Wang et al. [2023] Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-Consistency Improves Chain of Thought Reasoning in Language Models (2023) Shao et al. [2023] Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Zhou, H., Nova, A., Larochelle, H., Courville, A., Neyshabur, B., Sedghi, H.: Teaching algorithmic reasoning via in-context learning. arXiv preprint arXiv:2211.09066 (2022) Izacard et al. [2022] Izacard, G., Lewis, P., Lomeli, M., Hosseini, L., Petroni, F., Schick, T., Dwivedi-Yu, J., Joulin, A., Riedel, S., Grave, E.: Few-shot learning with retrieval augmented language models. arXiv preprint arXiv:2208.03299 (2022) Wei et al. [2022] Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E., Le, Q., Zhou, D.: Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903 (2022) Wang et al. [2023] Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-Consistency Improves Chain of Thought Reasoning in Language Models (2023) Shao et al. [2023] Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Izacard, G., Lewis, P., Lomeli, M., Hosseini, L., Petroni, F., Schick, T., Dwivedi-Yu, J., Joulin, A., Riedel, S., Grave, E.: Few-shot learning with retrieval augmented language models. arXiv preprint arXiv:2208.03299 (2022) Wei et al. [2022] Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E., Le, Q., Zhou, D.: Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903 (2022) Wang et al. [2023] Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-Consistency Improves Chain of Thought Reasoning in Language Models (2023) Shao et al. [2023] Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E., Le, Q., Zhou, D.: Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903 (2022) Wang et al. [2023] Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-Consistency Improves Chain of Thought Reasoning in Language Models (2023) Shao et al. [2023] Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-Consistency Improves Chain of Thought Reasoning in Language Models (2023) Shao et al. [2023] Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023)
  8. Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., Sutskever, I., et al.: Language models are unsupervised multitask learners. OpenAI blog 1(8), 9 (2019) Brown et al. [2020] Brown, T.B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D.M., Wu, J., Winter, C., Hesse, C., Chen, M., Sigler, E., Litwin, M., Gray, S., Chess, B., Clark, J., Berner, C., McCandlish, S., Radford, A., Sutskever, I., Amodei, D.: Language Models are Few-Shot Learners (2020) OpenAI [2023] OpenAI: GPT-4 Technical Report (2023) Phuong and Hutter [2022] Phuong, M., Hutter, M.: Formal algorithms for transformers. arXiv preprint arXiv:2207.09238 (2022) Ji et al. [2023] Ji, Z., Lee, N., Frieske, R., Yu, T., Su, D., Xu, Y., Ishii, E., Bang, Y.J., Madotto, A., Fung, P.: Survey of hallucination in natural language generation. ACM Comput. Surv. 55(12) (2023) https://doi.org/10.1145/3571730 Strubell et al. [2020] Strubell, E., Ganesh, A., McCallum, A.: Energy and policy considerations for modern deep learning research. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, pp. 13693–13696 (2020) Patterson et al. [2021] Patterson, D., Gonzalez, J., Le, Q., Liang, C., Munguia, L.-M., Rothchild, D., So, D., Texier, M., Dean, J.: Carbon emissions and large neural network training. arXiv preprint arXiv:2104.10350 (2021) Wu et al. [2022] Wu, C.-J., Raghavendra, R., Gupta, U., Acun, B., Ardalani, N., Maeng, K., Chang, G., Aga, F., Huang, J., Bai, C., et al.: Sustainable ai: Environmental implications, challenges and opportunities. Proceedings of Machine Learning and Systems 4, 795–813 (2022) Kaack et al. [2022] Kaack, L.H., Donti, P.L., Strubell, E., Kamiya, G., Creutzig, F., Rolnick, D.: Aligning artificial intelligence with climate change mitigation. Nature Climate Change 12(6), 518–527 (2022) Zhou et al. [2022] Zhou, H., Nova, A., Larochelle, H., Courville, A., Neyshabur, B., Sedghi, H.: Teaching algorithmic reasoning via in-context learning. arXiv preprint arXiv:2211.09066 (2022) Izacard et al. [2022] Izacard, G., Lewis, P., Lomeli, M., Hosseini, L., Petroni, F., Schick, T., Dwivedi-Yu, J., Joulin, A., Riedel, S., Grave, E.: Few-shot learning with retrieval augmented language models. arXiv preprint arXiv:2208.03299 (2022) Wei et al. [2022] Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E., Le, Q., Zhou, D.: Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903 (2022) Wang et al. [2023] Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-Consistency Improves Chain of Thought Reasoning in Language Models (2023) Shao et al. [2023] Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Brown, T.B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D.M., Wu, J., Winter, C., Hesse, C., Chen, M., Sigler, E., Litwin, M., Gray, S., Chess, B., Clark, J., Berner, C., McCandlish, S., Radford, A., Sutskever, I., Amodei, D.: Language Models are Few-Shot Learners (2020) OpenAI [2023] OpenAI: GPT-4 Technical Report (2023) Phuong and Hutter [2022] Phuong, M., Hutter, M.: Formal algorithms for transformers. arXiv preprint arXiv:2207.09238 (2022) Ji et al. [2023] Ji, Z., Lee, N., Frieske, R., Yu, T., Su, D., Xu, Y., Ishii, E., Bang, Y.J., Madotto, A., Fung, P.: Survey of hallucination in natural language generation. ACM Comput. Surv. 55(12) (2023) https://doi.org/10.1145/3571730 Strubell et al. [2020] Strubell, E., Ganesh, A., McCallum, A.: Energy and policy considerations for modern deep learning research. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, pp. 13693–13696 (2020) Patterson et al. [2021] Patterson, D., Gonzalez, J., Le, Q., Liang, C., Munguia, L.-M., Rothchild, D., So, D., Texier, M., Dean, J.: Carbon emissions and large neural network training. arXiv preprint arXiv:2104.10350 (2021) Wu et al. [2022] Wu, C.-J., Raghavendra, R., Gupta, U., Acun, B., Ardalani, N., Maeng, K., Chang, G., Aga, F., Huang, J., Bai, C., et al.: Sustainable ai: Environmental implications, challenges and opportunities. Proceedings of Machine Learning and Systems 4, 795–813 (2022) Kaack et al. [2022] Kaack, L.H., Donti, P.L., Strubell, E., Kamiya, G., Creutzig, F., Rolnick, D.: Aligning artificial intelligence with climate change mitigation. Nature Climate Change 12(6), 518–527 (2022) Zhou et al. [2022] Zhou, H., Nova, A., Larochelle, H., Courville, A., Neyshabur, B., Sedghi, H.: Teaching algorithmic reasoning via in-context learning. arXiv preprint arXiv:2211.09066 (2022) Izacard et al. [2022] Izacard, G., Lewis, P., Lomeli, M., Hosseini, L., Petroni, F., Schick, T., Dwivedi-Yu, J., Joulin, A., Riedel, S., Grave, E.: Few-shot learning with retrieval augmented language models. arXiv preprint arXiv:2208.03299 (2022) Wei et al. [2022] Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E., Le, Q., Zhou, D.: Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903 (2022) Wang et al. [2023] Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-Consistency Improves Chain of Thought Reasoning in Language Models (2023) Shao et al. [2023] Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) OpenAI: GPT-4 Technical Report (2023) Phuong and Hutter [2022] Phuong, M., Hutter, M.: Formal algorithms for transformers. arXiv preprint arXiv:2207.09238 (2022) Ji et al. [2023] Ji, Z., Lee, N., Frieske, R., Yu, T., Su, D., Xu, Y., Ishii, E., Bang, Y.J., Madotto, A., Fung, P.: Survey of hallucination in natural language generation. ACM Comput. Surv. 55(12) (2023) https://doi.org/10.1145/3571730 Strubell et al. [2020] Strubell, E., Ganesh, A., McCallum, A.: Energy and policy considerations for modern deep learning research. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, pp. 13693–13696 (2020) Patterson et al. [2021] Patterson, D., Gonzalez, J., Le, Q., Liang, C., Munguia, L.-M., Rothchild, D., So, D., Texier, M., Dean, J.: Carbon emissions and large neural network training. arXiv preprint arXiv:2104.10350 (2021) Wu et al. [2022] Wu, C.-J., Raghavendra, R., Gupta, U., Acun, B., Ardalani, N., Maeng, K., Chang, G., Aga, F., Huang, J., Bai, C., et al.: Sustainable ai: Environmental implications, challenges and opportunities. Proceedings of Machine Learning and Systems 4, 795–813 (2022) Kaack et al. [2022] Kaack, L.H., Donti, P.L., Strubell, E., Kamiya, G., Creutzig, F., Rolnick, D.: Aligning artificial intelligence with climate change mitigation. Nature Climate Change 12(6), 518–527 (2022) Zhou et al. [2022] Zhou, H., Nova, A., Larochelle, H., Courville, A., Neyshabur, B., Sedghi, H.: Teaching algorithmic reasoning via in-context learning. arXiv preprint arXiv:2211.09066 (2022) Izacard et al. [2022] Izacard, G., Lewis, P., Lomeli, M., Hosseini, L., Petroni, F., Schick, T., Dwivedi-Yu, J., Joulin, A., Riedel, S., Grave, E.: Few-shot learning with retrieval augmented language models. arXiv preprint arXiv:2208.03299 (2022) Wei et al. [2022] Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E., Le, Q., Zhou, D.: Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903 (2022) Wang et al. [2023] Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-Consistency Improves Chain of Thought Reasoning in Language Models (2023) Shao et al. [2023] Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Phuong, M., Hutter, M.: Formal algorithms for transformers. arXiv preprint arXiv:2207.09238 (2022) Ji et al. [2023] Ji, Z., Lee, N., Frieske, R., Yu, T., Su, D., Xu, Y., Ishii, E., Bang, Y.J., Madotto, A., Fung, P.: Survey of hallucination in natural language generation. ACM Comput. Surv. 55(12) (2023) https://doi.org/10.1145/3571730 Strubell et al. [2020] Strubell, E., Ganesh, A., McCallum, A.: Energy and policy considerations for modern deep learning research. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, pp. 13693–13696 (2020) Patterson et al. [2021] Patterson, D., Gonzalez, J., Le, Q., Liang, C., Munguia, L.-M., Rothchild, D., So, D., Texier, M., Dean, J.: Carbon emissions and large neural network training. arXiv preprint arXiv:2104.10350 (2021) Wu et al. [2022] Wu, C.-J., Raghavendra, R., Gupta, U., Acun, B., Ardalani, N., Maeng, K., Chang, G., Aga, F., Huang, J., Bai, C., et al.: Sustainable ai: Environmental implications, challenges and opportunities. Proceedings of Machine Learning and Systems 4, 795–813 (2022) Kaack et al. [2022] Kaack, L.H., Donti, P.L., Strubell, E., Kamiya, G., Creutzig, F., Rolnick, D.: Aligning artificial intelligence with climate change mitigation. Nature Climate Change 12(6), 518–527 (2022) Zhou et al. [2022] Zhou, H., Nova, A., Larochelle, H., Courville, A., Neyshabur, B., Sedghi, H.: Teaching algorithmic reasoning via in-context learning. arXiv preprint arXiv:2211.09066 (2022) Izacard et al. [2022] Izacard, G., Lewis, P., Lomeli, M., Hosseini, L., Petroni, F., Schick, T., Dwivedi-Yu, J., Joulin, A., Riedel, S., Grave, E.: Few-shot learning with retrieval augmented language models. arXiv preprint arXiv:2208.03299 (2022) Wei et al. [2022] Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E., Le, Q., Zhou, D.: Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903 (2022) Wang et al. [2023] Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-Consistency Improves Chain of Thought Reasoning in Language Models (2023) Shao et al. [2023] Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Ji, Z., Lee, N., Frieske, R., Yu, T., Su, D., Xu, Y., Ishii, E., Bang, Y.J., Madotto, A., Fung, P.: Survey of hallucination in natural language generation. ACM Comput. Surv. 55(12) (2023) https://doi.org/10.1145/3571730 Strubell et al. [2020] Strubell, E., Ganesh, A., McCallum, A.: Energy and policy considerations for modern deep learning research. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, pp. 13693–13696 (2020) Patterson et al. [2021] Patterson, D., Gonzalez, J., Le, Q., Liang, C., Munguia, L.-M., Rothchild, D., So, D., Texier, M., Dean, J.: Carbon emissions and large neural network training. arXiv preprint arXiv:2104.10350 (2021) Wu et al. [2022] Wu, C.-J., Raghavendra, R., Gupta, U., Acun, B., Ardalani, N., Maeng, K., Chang, G., Aga, F., Huang, J., Bai, C., et al.: Sustainable ai: Environmental implications, challenges and opportunities. Proceedings of Machine Learning and Systems 4, 795–813 (2022) Kaack et al. [2022] Kaack, L.H., Donti, P.L., Strubell, E., Kamiya, G., Creutzig, F., Rolnick, D.: Aligning artificial intelligence with climate change mitigation. Nature Climate Change 12(6), 518–527 (2022) Zhou et al. [2022] Zhou, H., Nova, A., Larochelle, H., Courville, A., Neyshabur, B., Sedghi, H.: Teaching algorithmic reasoning via in-context learning. arXiv preprint arXiv:2211.09066 (2022) Izacard et al. [2022] Izacard, G., Lewis, P., Lomeli, M., Hosseini, L., Petroni, F., Schick, T., Dwivedi-Yu, J., Joulin, A., Riedel, S., Grave, E.: Few-shot learning with retrieval augmented language models. arXiv preprint arXiv:2208.03299 (2022) Wei et al. [2022] Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E., Le, Q., Zhou, D.: Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903 (2022) Wang et al. [2023] Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-Consistency Improves Chain of Thought Reasoning in Language Models (2023) Shao et al. [2023] Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Strubell, E., Ganesh, A., McCallum, A.: Energy and policy considerations for modern deep learning research. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, pp. 13693–13696 (2020) Patterson et al. [2021] Patterson, D., Gonzalez, J., Le, Q., Liang, C., Munguia, L.-M., Rothchild, D., So, D., Texier, M., Dean, J.: Carbon emissions and large neural network training. arXiv preprint arXiv:2104.10350 (2021) Wu et al. [2022] Wu, C.-J., Raghavendra, R., Gupta, U., Acun, B., Ardalani, N., Maeng, K., Chang, G., Aga, F., Huang, J., Bai, C., et al.: Sustainable ai: Environmental implications, challenges and opportunities. Proceedings of Machine Learning and Systems 4, 795–813 (2022) Kaack et al. [2022] Kaack, L.H., Donti, P.L., Strubell, E., Kamiya, G., Creutzig, F., Rolnick, D.: Aligning artificial intelligence with climate change mitigation. Nature Climate Change 12(6), 518–527 (2022) Zhou et al. [2022] Zhou, H., Nova, A., Larochelle, H., Courville, A., Neyshabur, B., Sedghi, H.: Teaching algorithmic reasoning via in-context learning. arXiv preprint arXiv:2211.09066 (2022) Izacard et al. [2022] Izacard, G., Lewis, P., Lomeli, M., Hosseini, L., Petroni, F., Schick, T., Dwivedi-Yu, J., Joulin, A., Riedel, S., Grave, E.: Few-shot learning with retrieval augmented language models. arXiv preprint arXiv:2208.03299 (2022) Wei et al. [2022] Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E., Le, Q., Zhou, D.: Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903 (2022) Wang et al. [2023] Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-Consistency Improves Chain of Thought Reasoning in Language Models (2023) Shao et al. [2023] Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Patterson, D., Gonzalez, J., Le, Q., Liang, C., Munguia, L.-M., Rothchild, D., So, D., Texier, M., Dean, J.: Carbon emissions and large neural network training. arXiv preprint arXiv:2104.10350 (2021) Wu et al. [2022] Wu, C.-J., Raghavendra, R., Gupta, U., Acun, B., Ardalani, N., Maeng, K., Chang, G., Aga, F., Huang, J., Bai, C., et al.: Sustainable ai: Environmental implications, challenges and opportunities. Proceedings of Machine Learning and Systems 4, 795–813 (2022) Kaack et al. [2022] Kaack, L.H., Donti, P.L., Strubell, E., Kamiya, G., Creutzig, F., Rolnick, D.: Aligning artificial intelligence with climate change mitigation. Nature Climate Change 12(6), 518–527 (2022) Zhou et al. [2022] Zhou, H., Nova, A., Larochelle, H., Courville, A., Neyshabur, B., Sedghi, H.: Teaching algorithmic reasoning via in-context learning. arXiv preprint arXiv:2211.09066 (2022) Izacard et al. [2022] Izacard, G., Lewis, P., Lomeli, M., Hosseini, L., Petroni, F., Schick, T., Dwivedi-Yu, J., Joulin, A., Riedel, S., Grave, E.: Few-shot learning with retrieval augmented language models. arXiv preprint arXiv:2208.03299 (2022) Wei et al. [2022] Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E., Le, Q., Zhou, D.: Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903 (2022) Wang et al. [2023] Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-Consistency Improves Chain of Thought Reasoning in Language Models (2023) Shao et al. [2023] Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Wu, C.-J., Raghavendra, R., Gupta, U., Acun, B., Ardalani, N., Maeng, K., Chang, G., Aga, F., Huang, J., Bai, C., et al.: Sustainable ai: Environmental implications, challenges and opportunities. Proceedings of Machine Learning and Systems 4, 795–813 (2022) Kaack et al. [2022] Kaack, L.H., Donti, P.L., Strubell, E., Kamiya, G., Creutzig, F., Rolnick, D.: Aligning artificial intelligence with climate change mitigation. Nature Climate Change 12(6), 518–527 (2022) Zhou et al. [2022] Zhou, H., Nova, A., Larochelle, H., Courville, A., Neyshabur, B., Sedghi, H.: Teaching algorithmic reasoning via in-context learning. arXiv preprint arXiv:2211.09066 (2022) Izacard et al. [2022] Izacard, G., Lewis, P., Lomeli, M., Hosseini, L., Petroni, F., Schick, T., Dwivedi-Yu, J., Joulin, A., Riedel, S., Grave, E.: Few-shot learning with retrieval augmented language models. arXiv preprint arXiv:2208.03299 (2022) Wei et al. [2022] Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E., Le, Q., Zhou, D.: Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903 (2022) Wang et al. [2023] Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-Consistency Improves Chain of Thought Reasoning in Language Models (2023) Shao et al. [2023] Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Kaack, L.H., Donti, P.L., Strubell, E., Kamiya, G., Creutzig, F., Rolnick, D.: Aligning artificial intelligence with climate change mitigation. Nature Climate Change 12(6), 518–527 (2022) Zhou et al. [2022] Zhou, H., Nova, A., Larochelle, H., Courville, A., Neyshabur, B., Sedghi, H.: Teaching algorithmic reasoning via in-context learning. arXiv preprint arXiv:2211.09066 (2022) Izacard et al. [2022] Izacard, G., Lewis, P., Lomeli, M., Hosseini, L., Petroni, F., Schick, T., Dwivedi-Yu, J., Joulin, A., Riedel, S., Grave, E.: Few-shot learning with retrieval augmented language models. arXiv preprint arXiv:2208.03299 (2022) Wei et al. [2022] Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E., Le, Q., Zhou, D.: Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903 (2022) Wang et al. [2023] Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-Consistency Improves Chain of Thought Reasoning in Language Models (2023) Shao et al. [2023] Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Zhou, H., Nova, A., Larochelle, H., Courville, A., Neyshabur, B., Sedghi, H.: Teaching algorithmic reasoning via in-context learning. arXiv preprint arXiv:2211.09066 (2022) Izacard et al. [2022] Izacard, G., Lewis, P., Lomeli, M., Hosseini, L., Petroni, F., Schick, T., Dwivedi-Yu, J., Joulin, A., Riedel, S., Grave, E.: Few-shot learning with retrieval augmented language models. arXiv preprint arXiv:2208.03299 (2022) Wei et al. [2022] Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E., Le, Q., Zhou, D.: Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903 (2022) Wang et al. [2023] Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-Consistency Improves Chain of Thought Reasoning in Language Models (2023) Shao et al. [2023] Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Izacard, G., Lewis, P., Lomeli, M., Hosseini, L., Petroni, F., Schick, T., Dwivedi-Yu, J., Joulin, A., Riedel, S., Grave, E.: Few-shot learning with retrieval augmented language models. arXiv preprint arXiv:2208.03299 (2022) Wei et al. [2022] Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E., Le, Q., Zhou, D.: Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903 (2022) Wang et al. [2023] Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-Consistency Improves Chain of Thought Reasoning in Language Models (2023) Shao et al. [2023] Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E., Le, Q., Zhou, D.: Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903 (2022) Wang et al. [2023] Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-Consistency Improves Chain of Thought Reasoning in Language Models (2023) Shao et al. [2023] Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-Consistency Improves Chain of Thought Reasoning in Language Models (2023) Shao et al. [2023] Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023)
  9. Brown, T.B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D.M., Wu, J., Winter, C., Hesse, C., Chen, M., Sigler, E., Litwin, M., Gray, S., Chess, B., Clark, J., Berner, C., McCandlish, S., Radford, A., Sutskever, I., Amodei, D.: Language Models are Few-Shot Learners (2020) OpenAI [2023] OpenAI: GPT-4 Technical Report (2023) Phuong and Hutter [2022] Phuong, M., Hutter, M.: Formal algorithms for transformers. arXiv preprint arXiv:2207.09238 (2022) Ji et al. [2023] Ji, Z., Lee, N., Frieske, R., Yu, T., Su, D., Xu, Y., Ishii, E., Bang, Y.J., Madotto, A., Fung, P.: Survey of hallucination in natural language generation. ACM Comput. Surv. 55(12) (2023) https://doi.org/10.1145/3571730 Strubell et al. [2020] Strubell, E., Ganesh, A., McCallum, A.: Energy and policy considerations for modern deep learning research. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, pp. 13693–13696 (2020) Patterson et al. [2021] Patterson, D., Gonzalez, J., Le, Q., Liang, C., Munguia, L.-M., Rothchild, D., So, D., Texier, M., Dean, J.: Carbon emissions and large neural network training. arXiv preprint arXiv:2104.10350 (2021) Wu et al. [2022] Wu, C.-J., Raghavendra, R., Gupta, U., Acun, B., Ardalani, N., Maeng, K., Chang, G., Aga, F., Huang, J., Bai, C., et al.: Sustainable ai: Environmental implications, challenges and opportunities. Proceedings of Machine Learning and Systems 4, 795–813 (2022) Kaack et al. [2022] Kaack, L.H., Donti, P.L., Strubell, E., Kamiya, G., Creutzig, F., Rolnick, D.: Aligning artificial intelligence with climate change mitigation. Nature Climate Change 12(6), 518–527 (2022) Zhou et al. [2022] Zhou, H., Nova, A., Larochelle, H., Courville, A., Neyshabur, B., Sedghi, H.: Teaching algorithmic reasoning via in-context learning. arXiv preprint arXiv:2211.09066 (2022) Izacard et al. [2022] Izacard, G., Lewis, P., Lomeli, M., Hosseini, L., Petroni, F., Schick, T., Dwivedi-Yu, J., Joulin, A., Riedel, S., Grave, E.: Few-shot learning with retrieval augmented language models. arXiv preprint arXiv:2208.03299 (2022) Wei et al. [2022] Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E., Le, Q., Zhou, D.: Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903 (2022) Wang et al. [2023] Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-Consistency Improves Chain of Thought Reasoning in Language Models (2023) Shao et al. [2023] Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) OpenAI: GPT-4 Technical Report (2023) Phuong and Hutter [2022] Phuong, M., Hutter, M.: Formal algorithms for transformers. arXiv preprint arXiv:2207.09238 (2022) Ji et al. [2023] Ji, Z., Lee, N., Frieske, R., Yu, T., Su, D., Xu, Y., Ishii, E., Bang, Y.J., Madotto, A., Fung, P.: Survey of hallucination in natural language generation. ACM Comput. Surv. 55(12) (2023) https://doi.org/10.1145/3571730 Strubell et al. [2020] Strubell, E., Ganesh, A., McCallum, A.: Energy and policy considerations for modern deep learning research. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, pp. 13693–13696 (2020) Patterson et al. [2021] Patterson, D., Gonzalez, J., Le, Q., Liang, C., Munguia, L.-M., Rothchild, D., So, D., Texier, M., Dean, J.: Carbon emissions and large neural network training. arXiv preprint arXiv:2104.10350 (2021) Wu et al. [2022] Wu, C.-J., Raghavendra, R., Gupta, U., Acun, B., Ardalani, N., Maeng, K., Chang, G., Aga, F., Huang, J., Bai, C., et al.: Sustainable ai: Environmental implications, challenges and opportunities. Proceedings of Machine Learning and Systems 4, 795–813 (2022) Kaack et al. [2022] Kaack, L.H., Donti, P.L., Strubell, E., Kamiya, G., Creutzig, F., Rolnick, D.: Aligning artificial intelligence with climate change mitigation. Nature Climate Change 12(6), 518–527 (2022) Zhou et al. [2022] Zhou, H., Nova, A., Larochelle, H., Courville, A., Neyshabur, B., Sedghi, H.: Teaching algorithmic reasoning via in-context learning. arXiv preprint arXiv:2211.09066 (2022) Izacard et al. [2022] Izacard, G., Lewis, P., Lomeli, M., Hosseini, L., Petroni, F., Schick, T., Dwivedi-Yu, J., Joulin, A., Riedel, S., Grave, E.: Few-shot learning with retrieval augmented language models. arXiv preprint arXiv:2208.03299 (2022) Wei et al. [2022] Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E., Le, Q., Zhou, D.: Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903 (2022) Wang et al. [2023] Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-Consistency Improves Chain of Thought Reasoning in Language Models (2023) Shao et al. [2023] Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Phuong, M., Hutter, M.: Formal algorithms for transformers. arXiv preprint arXiv:2207.09238 (2022) Ji et al. [2023] Ji, Z., Lee, N., Frieske, R., Yu, T., Su, D., Xu, Y., Ishii, E., Bang, Y.J., Madotto, A., Fung, P.: Survey of hallucination in natural language generation. ACM Comput. Surv. 55(12) (2023) https://doi.org/10.1145/3571730 Strubell et al. [2020] Strubell, E., Ganesh, A., McCallum, A.: Energy and policy considerations for modern deep learning research. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, pp. 13693–13696 (2020) Patterson et al. [2021] Patterson, D., Gonzalez, J., Le, Q., Liang, C., Munguia, L.-M., Rothchild, D., So, D., Texier, M., Dean, J.: Carbon emissions and large neural network training. arXiv preprint arXiv:2104.10350 (2021) Wu et al. [2022] Wu, C.-J., Raghavendra, R., Gupta, U., Acun, B., Ardalani, N., Maeng, K., Chang, G., Aga, F., Huang, J., Bai, C., et al.: Sustainable ai: Environmental implications, challenges and opportunities. Proceedings of Machine Learning and Systems 4, 795–813 (2022) Kaack et al. [2022] Kaack, L.H., Donti, P.L., Strubell, E., Kamiya, G., Creutzig, F., Rolnick, D.: Aligning artificial intelligence with climate change mitigation. Nature Climate Change 12(6), 518–527 (2022) Zhou et al. [2022] Zhou, H., Nova, A., Larochelle, H., Courville, A., Neyshabur, B., Sedghi, H.: Teaching algorithmic reasoning via in-context learning. arXiv preprint arXiv:2211.09066 (2022) Izacard et al. [2022] Izacard, G., Lewis, P., Lomeli, M., Hosseini, L., Petroni, F., Schick, T., Dwivedi-Yu, J., Joulin, A., Riedel, S., Grave, E.: Few-shot learning with retrieval augmented language models. arXiv preprint arXiv:2208.03299 (2022) Wei et al. [2022] Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E., Le, Q., Zhou, D.: Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903 (2022) Wang et al. [2023] Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-Consistency Improves Chain of Thought Reasoning in Language Models (2023) Shao et al. [2023] Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Ji, Z., Lee, N., Frieske, R., Yu, T., Su, D., Xu, Y., Ishii, E., Bang, Y.J., Madotto, A., Fung, P.: Survey of hallucination in natural language generation. ACM Comput. Surv. 55(12) (2023) https://doi.org/10.1145/3571730 Strubell et al. [2020] Strubell, E., Ganesh, A., McCallum, A.: Energy and policy considerations for modern deep learning research. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, pp. 13693–13696 (2020) Patterson et al. [2021] Patterson, D., Gonzalez, J., Le, Q., Liang, C., Munguia, L.-M., Rothchild, D., So, D., Texier, M., Dean, J.: Carbon emissions and large neural network training. arXiv preprint arXiv:2104.10350 (2021) Wu et al. [2022] Wu, C.-J., Raghavendra, R., Gupta, U., Acun, B., Ardalani, N., Maeng, K., Chang, G., Aga, F., Huang, J., Bai, C., et al.: Sustainable ai: Environmental implications, challenges and opportunities. Proceedings of Machine Learning and Systems 4, 795–813 (2022) Kaack et al. [2022] Kaack, L.H., Donti, P.L., Strubell, E., Kamiya, G., Creutzig, F., Rolnick, D.: Aligning artificial intelligence with climate change mitigation. Nature Climate Change 12(6), 518–527 (2022) Zhou et al. [2022] Zhou, H., Nova, A., Larochelle, H., Courville, A., Neyshabur, B., Sedghi, H.: Teaching algorithmic reasoning via in-context learning. arXiv preprint arXiv:2211.09066 (2022) Izacard et al. [2022] Izacard, G., Lewis, P., Lomeli, M., Hosseini, L., Petroni, F., Schick, T., Dwivedi-Yu, J., Joulin, A., Riedel, S., Grave, E.: Few-shot learning with retrieval augmented language models. arXiv preprint arXiv:2208.03299 (2022) Wei et al. [2022] Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E., Le, Q., Zhou, D.: Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903 (2022) Wang et al. [2023] Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-Consistency Improves Chain of Thought Reasoning in Language Models (2023) Shao et al. [2023] Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Strubell, E., Ganesh, A., McCallum, A.: Energy and policy considerations for modern deep learning research. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, pp. 13693–13696 (2020) Patterson et al. [2021] Patterson, D., Gonzalez, J., Le, Q., Liang, C., Munguia, L.-M., Rothchild, D., So, D., Texier, M., Dean, J.: Carbon emissions and large neural network training. arXiv preprint arXiv:2104.10350 (2021) Wu et al. [2022] Wu, C.-J., Raghavendra, R., Gupta, U., Acun, B., Ardalani, N., Maeng, K., Chang, G., Aga, F., Huang, J., Bai, C., et al.: Sustainable ai: Environmental implications, challenges and opportunities. Proceedings of Machine Learning and Systems 4, 795–813 (2022) Kaack et al. [2022] Kaack, L.H., Donti, P.L., Strubell, E., Kamiya, G., Creutzig, F., Rolnick, D.: Aligning artificial intelligence with climate change mitigation. Nature Climate Change 12(6), 518–527 (2022) Zhou et al. [2022] Zhou, H., Nova, A., Larochelle, H., Courville, A., Neyshabur, B., Sedghi, H.: Teaching algorithmic reasoning via in-context learning. arXiv preprint arXiv:2211.09066 (2022) Izacard et al. [2022] Izacard, G., Lewis, P., Lomeli, M., Hosseini, L., Petroni, F., Schick, T., Dwivedi-Yu, J., Joulin, A., Riedel, S., Grave, E.: Few-shot learning with retrieval augmented language models. arXiv preprint arXiv:2208.03299 (2022) Wei et al. [2022] Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E., Le, Q., Zhou, D.: Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903 (2022) Wang et al. [2023] Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-Consistency Improves Chain of Thought Reasoning in Language Models (2023) Shao et al. [2023] Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Patterson, D., Gonzalez, J., Le, Q., Liang, C., Munguia, L.-M., Rothchild, D., So, D., Texier, M., Dean, J.: Carbon emissions and large neural network training. arXiv preprint arXiv:2104.10350 (2021) Wu et al. [2022] Wu, C.-J., Raghavendra, R., Gupta, U., Acun, B., Ardalani, N., Maeng, K., Chang, G., Aga, F., Huang, J., Bai, C., et al.: Sustainable ai: Environmental implications, challenges and opportunities. Proceedings of Machine Learning and Systems 4, 795–813 (2022) Kaack et al. [2022] Kaack, L.H., Donti, P.L., Strubell, E., Kamiya, G., Creutzig, F., Rolnick, D.: Aligning artificial intelligence with climate change mitigation. Nature Climate Change 12(6), 518–527 (2022) Zhou et al. [2022] Zhou, H., Nova, A., Larochelle, H., Courville, A., Neyshabur, B., Sedghi, H.: Teaching algorithmic reasoning via in-context learning. arXiv preprint arXiv:2211.09066 (2022) Izacard et al. [2022] Izacard, G., Lewis, P., Lomeli, M., Hosseini, L., Petroni, F., Schick, T., Dwivedi-Yu, J., Joulin, A., Riedel, S., Grave, E.: Few-shot learning with retrieval augmented language models. arXiv preprint arXiv:2208.03299 (2022) Wei et al. [2022] Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E., Le, Q., Zhou, D.: Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903 (2022) Wang et al. [2023] Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-Consistency Improves Chain of Thought Reasoning in Language Models (2023) Shao et al. [2023] Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Wu, C.-J., Raghavendra, R., Gupta, U., Acun, B., Ardalani, N., Maeng, K., Chang, G., Aga, F., Huang, J., Bai, C., et al.: Sustainable ai: Environmental implications, challenges and opportunities. Proceedings of Machine Learning and Systems 4, 795–813 (2022) Kaack et al. [2022] Kaack, L.H., Donti, P.L., Strubell, E., Kamiya, G., Creutzig, F., Rolnick, D.: Aligning artificial intelligence with climate change mitigation. Nature Climate Change 12(6), 518–527 (2022) Zhou et al. [2022] Zhou, H., Nova, A., Larochelle, H., Courville, A., Neyshabur, B., Sedghi, H.: Teaching algorithmic reasoning via in-context learning. arXiv preprint arXiv:2211.09066 (2022) Izacard et al. [2022] Izacard, G., Lewis, P., Lomeli, M., Hosseini, L., Petroni, F., Schick, T., Dwivedi-Yu, J., Joulin, A., Riedel, S., Grave, E.: Few-shot learning with retrieval augmented language models. arXiv preprint arXiv:2208.03299 (2022) Wei et al. [2022] Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E., Le, Q., Zhou, D.: Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903 (2022) Wang et al. [2023] Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-Consistency Improves Chain of Thought Reasoning in Language Models (2023) Shao et al. [2023] Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Kaack, L.H., Donti, P.L., Strubell, E., Kamiya, G., Creutzig, F., Rolnick, D.: Aligning artificial intelligence with climate change mitigation. Nature Climate Change 12(6), 518–527 (2022) Zhou et al. [2022] Zhou, H., Nova, A., Larochelle, H., Courville, A., Neyshabur, B., Sedghi, H.: Teaching algorithmic reasoning via in-context learning. arXiv preprint arXiv:2211.09066 (2022) Izacard et al. [2022] Izacard, G., Lewis, P., Lomeli, M., Hosseini, L., Petroni, F., Schick, T., Dwivedi-Yu, J., Joulin, A., Riedel, S., Grave, E.: Few-shot learning with retrieval augmented language models. arXiv preprint arXiv:2208.03299 (2022) Wei et al. [2022] Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E., Le, Q., Zhou, D.: Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903 (2022) Wang et al. [2023] Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-Consistency Improves Chain of Thought Reasoning in Language Models (2023) Shao et al. [2023] Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Zhou, H., Nova, A., Larochelle, H., Courville, A., Neyshabur, B., Sedghi, H.: Teaching algorithmic reasoning via in-context learning. arXiv preprint arXiv:2211.09066 (2022) Izacard et al. [2022] Izacard, G., Lewis, P., Lomeli, M., Hosseini, L., Petroni, F., Schick, T., Dwivedi-Yu, J., Joulin, A., Riedel, S., Grave, E.: Few-shot learning with retrieval augmented language models. arXiv preprint arXiv:2208.03299 (2022) Wei et al. [2022] Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E., Le, Q., Zhou, D.: Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903 (2022) Wang et al. [2023] Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-Consistency Improves Chain of Thought Reasoning in Language Models (2023) Shao et al. [2023] Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Izacard, G., Lewis, P., Lomeli, M., Hosseini, L., Petroni, F., Schick, T., Dwivedi-Yu, J., Joulin, A., Riedel, S., Grave, E.: Few-shot learning with retrieval augmented language models. arXiv preprint arXiv:2208.03299 (2022) Wei et al. [2022] Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E., Le, Q., Zhou, D.: Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903 (2022) Wang et al. [2023] Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-Consistency Improves Chain of Thought Reasoning in Language Models (2023) Shao et al. [2023] Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E., Le, Q., Zhou, D.: Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903 (2022) Wang et al. [2023] Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-Consistency Improves Chain of Thought Reasoning in Language Models (2023) Shao et al. [2023] Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-Consistency Improves Chain of Thought Reasoning in Language Models (2023) Shao et al. [2023] Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023)
  10. OpenAI: GPT-4 Technical Report (2023) Phuong and Hutter [2022] Phuong, M., Hutter, M.: Formal algorithms for transformers. arXiv preprint arXiv:2207.09238 (2022) Ji et al. [2023] Ji, Z., Lee, N., Frieske, R., Yu, T., Su, D., Xu, Y., Ishii, E., Bang, Y.J., Madotto, A., Fung, P.: Survey of hallucination in natural language generation. ACM Comput. Surv. 55(12) (2023) https://doi.org/10.1145/3571730 Strubell et al. [2020] Strubell, E., Ganesh, A., McCallum, A.: Energy and policy considerations for modern deep learning research. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, pp. 13693–13696 (2020) Patterson et al. [2021] Patterson, D., Gonzalez, J., Le, Q., Liang, C., Munguia, L.-M., Rothchild, D., So, D., Texier, M., Dean, J.: Carbon emissions and large neural network training. arXiv preprint arXiv:2104.10350 (2021) Wu et al. [2022] Wu, C.-J., Raghavendra, R., Gupta, U., Acun, B., Ardalani, N., Maeng, K., Chang, G., Aga, F., Huang, J., Bai, C., et al.: Sustainable ai: Environmental implications, challenges and opportunities. Proceedings of Machine Learning and Systems 4, 795–813 (2022) Kaack et al. [2022] Kaack, L.H., Donti, P.L., Strubell, E., Kamiya, G., Creutzig, F., Rolnick, D.: Aligning artificial intelligence with climate change mitigation. Nature Climate Change 12(6), 518–527 (2022) Zhou et al. [2022] Zhou, H., Nova, A., Larochelle, H., Courville, A., Neyshabur, B., Sedghi, H.: Teaching algorithmic reasoning via in-context learning. arXiv preprint arXiv:2211.09066 (2022) Izacard et al. [2022] Izacard, G., Lewis, P., Lomeli, M., Hosseini, L., Petroni, F., Schick, T., Dwivedi-Yu, J., Joulin, A., Riedel, S., Grave, E.: Few-shot learning with retrieval augmented language models. arXiv preprint arXiv:2208.03299 (2022) Wei et al. [2022] Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E., Le, Q., Zhou, D.: Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903 (2022) Wang et al. [2023] Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-Consistency Improves Chain of Thought Reasoning in Language Models (2023) Shao et al. [2023] Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Phuong, M., Hutter, M.: Formal algorithms for transformers. arXiv preprint arXiv:2207.09238 (2022) Ji et al. [2023] Ji, Z., Lee, N., Frieske, R., Yu, T., Su, D., Xu, Y., Ishii, E., Bang, Y.J., Madotto, A., Fung, P.: Survey of hallucination in natural language generation. ACM Comput. Surv. 55(12) (2023) https://doi.org/10.1145/3571730 Strubell et al. [2020] Strubell, E., Ganesh, A., McCallum, A.: Energy and policy considerations for modern deep learning research. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, pp. 13693–13696 (2020) Patterson et al. [2021] Patterson, D., Gonzalez, J., Le, Q., Liang, C., Munguia, L.-M., Rothchild, D., So, D., Texier, M., Dean, J.: Carbon emissions and large neural network training. arXiv preprint arXiv:2104.10350 (2021) Wu et al. [2022] Wu, C.-J., Raghavendra, R., Gupta, U., Acun, B., Ardalani, N., Maeng, K., Chang, G., Aga, F., Huang, J., Bai, C., et al.: Sustainable ai: Environmental implications, challenges and opportunities. Proceedings of Machine Learning and Systems 4, 795–813 (2022) Kaack et al. [2022] Kaack, L.H., Donti, P.L., Strubell, E., Kamiya, G., Creutzig, F., Rolnick, D.: Aligning artificial intelligence with climate change mitigation. Nature Climate Change 12(6), 518–527 (2022) Zhou et al. [2022] Zhou, H., Nova, A., Larochelle, H., Courville, A., Neyshabur, B., Sedghi, H.: Teaching algorithmic reasoning via in-context learning. arXiv preprint arXiv:2211.09066 (2022) Izacard et al. [2022] Izacard, G., Lewis, P., Lomeli, M., Hosseini, L., Petroni, F., Schick, T., Dwivedi-Yu, J., Joulin, A., Riedel, S., Grave, E.: Few-shot learning with retrieval augmented language models. arXiv preprint arXiv:2208.03299 (2022) Wei et al. [2022] Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E., Le, Q., Zhou, D.: Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903 (2022) Wang et al. [2023] Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-Consistency Improves Chain of Thought Reasoning in Language Models (2023) Shao et al. [2023] Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Ji, Z., Lee, N., Frieske, R., Yu, T., Su, D., Xu, Y., Ishii, E., Bang, Y.J., Madotto, A., Fung, P.: Survey of hallucination in natural language generation. ACM Comput. Surv. 55(12) (2023) https://doi.org/10.1145/3571730 Strubell et al. [2020] Strubell, E., Ganesh, A., McCallum, A.: Energy and policy considerations for modern deep learning research. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, pp. 13693–13696 (2020) Patterson et al. [2021] Patterson, D., Gonzalez, J., Le, Q., Liang, C., Munguia, L.-M., Rothchild, D., So, D., Texier, M., Dean, J.: Carbon emissions and large neural network training. arXiv preprint arXiv:2104.10350 (2021) Wu et al. [2022] Wu, C.-J., Raghavendra, R., Gupta, U., Acun, B., Ardalani, N., Maeng, K., Chang, G., Aga, F., Huang, J., Bai, C., et al.: Sustainable ai: Environmental implications, challenges and opportunities. Proceedings of Machine Learning and Systems 4, 795–813 (2022) Kaack et al. [2022] Kaack, L.H., Donti, P.L., Strubell, E., Kamiya, G., Creutzig, F., Rolnick, D.: Aligning artificial intelligence with climate change mitigation. Nature Climate Change 12(6), 518–527 (2022) Zhou et al. [2022] Zhou, H., Nova, A., Larochelle, H., Courville, A., Neyshabur, B., Sedghi, H.: Teaching algorithmic reasoning via in-context learning. arXiv preprint arXiv:2211.09066 (2022) Izacard et al. [2022] Izacard, G., Lewis, P., Lomeli, M., Hosseini, L., Petroni, F., Schick, T., Dwivedi-Yu, J., Joulin, A., Riedel, S., Grave, E.: Few-shot learning with retrieval augmented language models. arXiv preprint arXiv:2208.03299 (2022) Wei et al. [2022] Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E., Le, Q., Zhou, D.: Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903 (2022) Wang et al. [2023] Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-Consistency Improves Chain of Thought Reasoning in Language Models (2023) Shao et al. [2023] Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Strubell, E., Ganesh, A., McCallum, A.: Energy and policy considerations for modern deep learning research. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, pp. 13693–13696 (2020) Patterson et al. [2021] Patterson, D., Gonzalez, J., Le, Q., Liang, C., Munguia, L.-M., Rothchild, D., So, D., Texier, M., Dean, J.: Carbon emissions and large neural network training. arXiv preprint arXiv:2104.10350 (2021) Wu et al. [2022] Wu, C.-J., Raghavendra, R., Gupta, U., Acun, B., Ardalani, N., Maeng, K., Chang, G., Aga, F., Huang, J., Bai, C., et al.: Sustainable ai: Environmental implications, challenges and opportunities. Proceedings of Machine Learning and Systems 4, 795–813 (2022) Kaack et al. [2022] Kaack, L.H., Donti, P.L., Strubell, E., Kamiya, G., Creutzig, F., Rolnick, D.: Aligning artificial intelligence with climate change mitigation. Nature Climate Change 12(6), 518–527 (2022) Zhou et al. [2022] Zhou, H., Nova, A., Larochelle, H., Courville, A., Neyshabur, B., Sedghi, H.: Teaching algorithmic reasoning via in-context learning. arXiv preprint arXiv:2211.09066 (2022) Izacard et al. [2022] Izacard, G., Lewis, P., Lomeli, M., Hosseini, L., Petroni, F., Schick, T., Dwivedi-Yu, J., Joulin, A., Riedel, S., Grave, E.: Few-shot learning with retrieval augmented language models. arXiv preprint arXiv:2208.03299 (2022) Wei et al. [2022] Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E., Le, Q., Zhou, D.: Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903 (2022) Wang et al. [2023] Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-Consistency Improves Chain of Thought Reasoning in Language Models (2023) Shao et al. [2023] Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Patterson, D., Gonzalez, J., Le, Q., Liang, C., Munguia, L.-M., Rothchild, D., So, D., Texier, M., Dean, J.: Carbon emissions and large neural network training. arXiv preprint arXiv:2104.10350 (2021) Wu et al. [2022] Wu, C.-J., Raghavendra, R., Gupta, U., Acun, B., Ardalani, N., Maeng, K., Chang, G., Aga, F., Huang, J., Bai, C., et al.: Sustainable ai: Environmental implications, challenges and opportunities. Proceedings of Machine Learning and Systems 4, 795–813 (2022) Kaack et al. [2022] Kaack, L.H., Donti, P.L., Strubell, E., Kamiya, G., Creutzig, F., Rolnick, D.: Aligning artificial intelligence with climate change mitigation. Nature Climate Change 12(6), 518–527 (2022) Zhou et al. [2022] Zhou, H., Nova, A., Larochelle, H., Courville, A., Neyshabur, B., Sedghi, H.: Teaching algorithmic reasoning via in-context learning. arXiv preprint arXiv:2211.09066 (2022) Izacard et al. [2022] Izacard, G., Lewis, P., Lomeli, M., Hosseini, L., Petroni, F., Schick, T., Dwivedi-Yu, J., Joulin, A., Riedel, S., Grave, E.: Few-shot learning with retrieval augmented language models. arXiv preprint arXiv:2208.03299 (2022) Wei et al. [2022] Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E., Le, Q., Zhou, D.: Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903 (2022) Wang et al. [2023] Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-Consistency Improves Chain of Thought Reasoning in Language Models (2023) Shao et al. [2023] Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Wu, C.-J., Raghavendra, R., Gupta, U., Acun, B., Ardalani, N., Maeng, K., Chang, G., Aga, F., Huang, J., Bai, C., et al.: Sustainable ai: Environmental implications, challenges and opportunities. Proceedings of Machine Learning and Systems 4, 795–813 (2022) Kaack et al. [2022] Kaack, L.H., Donti, P.L., Strubell, E., Kamiya, G., Creutzig, F., Rolnick, D.: Aligning artificial intelligence with climate change mitigation. Nature Climate Change 12(6), 518–527 (2022) Zhou et al. [2022] Zhou, H., Nova, A., Larochelle, H., Courville, A., Neyshabur, B., Sedghi, H.: Teaching algorithmic reasoning via in-context learning. arXiv preprint arXiv:2211.09066 (2022) Izacard et al. [2022] Izacard, G., Lewis, P., Lomeli, M., Hosseini, L., Petroni, F., Schick, T., Dwivedi-Yu, J., Joulin, A., Riedel, S., Grave, E.: Few-shot learning with retrieval augmented language models. arXiv preprint arXiv:2208.03299 (2022) Wei et al. [2022] Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E., Le, Q., Zhou, D.: Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903 (2022) Wang et al. [2023] Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-Consistency Improves Chain of Thought Reasoning in Language Models (2023) Shao et al. [2023] Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Kaack, L.H., Donti, P.L., Strubell, E., Kamiya, G., Creutzig, F., Rolnick, D.: Aligning artificial intelligence with climate change mitigation. Nature Climate Change 12(6), 518–527 (2022) Zhou et al. [2022] Zhou, H., Nova, A., Larochelle, H., Courville, A., Neyshabur, B., Sedghi, H.: Teaching algorithmic reasoning via in-context learning. arXiv preprint arXiv:2211.09066 (2022) Izacard et al. [2022] Izacard, G., Lewis, P., Lomeli, M., Hosseini, L., Petroni, F., Schick, T., Dwivedi-Yu, J., Joulin, A., Riedel, S., Grave, E.: Few-shot learning with retrieval augmented language models. arXiv preprint arXiv:2208.03299 (2022) Wei et al. [2022] Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E., Le, Q., Zhou, D.: Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903 (2022) Wang et al. [2023] Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-Consistency Improves Chain of Thought Reasoning in Language Models (2023) Shao et al. [2023] Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Zhou, H., Nova, A., Larochelle, H., Courville, A., Neyshabur, B., Sedghi, H.: Teaching algorithmic reasoning via in-context learning. arXiv preprint arXiv:2211.09066 (2022) Izacard et al. [2022] Izacard, G., Lewis, P., Lomeli, M., Hosseini, L., Petroni, F., Schick, T., Dwivedi-Yu, J., Joulin, A., Riedel, S., Grave, E.: Few-shot learning with retrieval augmented language models. arXiv preprint arXiv:2208.03299 (2022) Wei et al. [2022] Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E., Le, Q., Zhou, D.: Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903 (2022) Wang et al. [2023] Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-Consistency Improves Chain of Thought Reasoning in Language Models (2023) Shao et al. [2023] Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Izacard, G., Lewis, P., Lomeli, M., Hosseini, L., Petroni, F., Schick, T., Dwivedi-Yu, J., Joulin, A., Riedel, S., Grave, E.: Few-shot learning with retrieval augmented language models. arXiv preprint arXiv:2208.03299 (2022) Wei et al. [2022] Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E., Le, Q., Zhou, D.: Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903 (2022) Wang et al. [2023] Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-Consistency Improves Chain of Thought Reasoning in Language Models (2023) Shao et al. [2023] Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E., Le, Q., Zhou, D.: Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903 (2022) Wang et al. [2023] Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-Consistency Improves Chain of Thought Reasoning in Language Models (2023) Shao et al. [2023] Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-Consistency Improves Chain of Thought Reasoning in Language Models (2023) Shao et al. [2023] Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023)
  11. Phuong, M., Hutter, M.: Formal algorithms for transformers. arXiv preprint arXiv:2207.09238 (2022) Ji et al. [2023] Ji, Z., Lee, N., Frieske, R., Yu, T., Su, D., Xu, Y., Ishii, E., Bang, Y.J., Madotto, A., Fung, P.: Survey of hallucination in natural language generation. ACM Comput. Surv. 55(12) (2023) https://doi.org/10.1145/3571730 Strubell et al. [2020] Strubell, E., Ganesh, A., McCallum, A.: Energy and policy considerations for modern deep learning research. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, pp. 13693–13696 (2020) Patterson et al. [2021] Patterson, D., Gonzalez, J., Le, Q., Liang, C., Munguia, L.-M., Rothchild, D., So, D., Texier, M., Dean, J.: Carbon emissions and large neural network training. arXiv preprint arXiv:2104.10350 (2021) Wu et al. [2022] Wu, C.-J., Raghavendra, R., Gupta, U., Acun, B., Ardalani, N., Maeng, K., Chang, G., Aga, F., Huang, J., Bai, C., et al.: Sustainable ai: Environmental implications, challenges and opportunities. Proceedings of Machine Learning and Systems 4, 795–813 (2022) Kaack et al. [2022] Kaack, L.H., Donti, P.L., Strubell, E., Kamiya, G., Creutzig, F., Rolnick, D.: Aligning artificial intelligence with climate change mitigation. Nature Climate Change 12(6), 518–527 (2022) Zhou et al. [2022] Zhou, H., Nova, A., Larochelle, H., Courville, A., Neyshabur, B., Sedghi, H.: Teaching algorithmic reasoning via in-context learning. arXiv preprint arXiv:2211.09066 (2022) Izacard et al. [2022] Izacard, G., Lewis, P., Lomeli, M., Hosseini, L., Petroni, F., Schick, T., Dwivedi-Yu, J., Joulin, A., Riedel, S., Grave, E.: Few-shot learning with retrieval augmented language models. arXiv preprint arXiv:2208.03299 (2022) Wei et al. [2022] Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E., Le, Q., Zhou, D.: Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903 (2022) Wang et al. [2023] Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-Consistency Improves Chain of Thought Reasoning in Language Models (2023) Shao et al. [2023] Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Ji, Z., Lee, N., Frieske, R., Yu, T., Su, D., Xu, Y., Ishii, E., Bang, Y.J., Madotto, A., Fung, P.: Survey of hallucination in natural language generation. ACM Comput. Surv. 55(12) (2023) https://doi.org/10.1145/3571730 Strubell et al. [2020] Strubell, E., Ganesh, A., McCallum, A.: Energy and policy considerations for modern deep learning research. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, pp. 13693–13696 (2020) Patterson et al. [2021] Patterson, D., Gonzalez, J., Le, Q., Liang, C., Munguia, L.-M., Rothchild, D., So, D., Texier, M., Dean, J.: Carbon emissions and large neural network training. arXiv preprint arXiv:2104.10350 (2021) Wu et al. [2022] Wu, C.-J., Raghavendra, R., Gupta, U., Acun, B., Ardalani, N., Maeng, K., Chang, G., Aga, F., Huang, J., Bai, C., et al.: Sustainable ai: Environmental implications, challenges and opportunities. Proceedings of Machine Learning and Systems 4, 795–813 (2022) Kaack et al. [2022] Kaack, L.H., Donti, P.L., Strubell, E., Kamiya, G., Creutzig, F., Rolnick, D.: Aligning artificial intelligence with climate change mitigation. Nature Climate Change 12(6), 518–527 (2022) Zhou et al. [2022] Zhou, H., Nova, A., Larochelle, H., Courville, A., Neyshabur, B., Sedghi, H.: Teaching algorithmic reasoning via in-context learning. arXiv preprint arXiv:2211.09066 (2022) Izacard et al. [2022] Izacard, G., Lewis, P., Lomeli, M., Hosseini, L., Petroni, F., Schick, T., Dwivedi-Yu, J., Joulin, A., Riedel, S., Grave, E.: Few-shot learning with retrieval augmented language models. arXiv preprint arXiv:2208.03299 (2022) Wei et al. [2022] Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E., Le, Q., Zhou, D.: Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903 (2022) Wang et al. [2023] Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-Consistency Improves Chain of Thought Reasoning in Language Models (2023) Shao et al. [2023] Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Strubell, E., Ganesh, A., McCallum, A.: Energy and policy considerations for modern deep learning research. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, pp. 13693–13696 (2020) Patterson et al. [2021] Patterson, D., Gonzalez, J., Le, Q., Liang, C., Munguia, L.-M., Rothchild, D., So, D., Texier, M., Dean, J.: Carbon emissions and large neural network training. arXiv preprint arXiv:2104.10350 (2021) Wu et al. [2022] Wu, C.-J., Raghavendra, R., Gupta, U., Acun, B., Ardalani, N., Maeng, K., Chang, G., Aga, F., Huang, J., Bai, C., et al.: Sustainable ai: Environmental implications, challenges and opportunities. Proceedings of Machine Learning and Systems 4, 795–813 (2022) Kaack et al. [2022] Kaack, L.H., Donti, P.L., Strubell, E., Kamiya, G., Creutzig, F., Rolnick, D.: Aligning artificial intelligence with climate change mitigation. Nature Climate Change 12(6), 518–527 (2022) Zhou et al. [2022] Zhou, H., Nova, A., Larochelle, H., Courville, A., Neyshabur, B., Sedghi, H.: Teaching algorithmic reasoning via in-context learning. arXiv preprint arXiv:2211.09066 (2022) Izacard et al. [2022] Izacard, G., Lewis, P., Lomeli, M., Hosseini, L., Petroni, F., Schick, T., Dwivedi-Yu, J., Joulin, A., Riedel, S., Grave, E.: Few-shot learning with retrieval augmented language models. arXiv preprint arXiv:2208.03299 (2022) Wei et al. [2022] Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E., Le, Q., Zhou, D.: Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903 (2022) Wang et al. [2023] Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-Consistency Improves Chain of Thought Reasoning in Language Models (2023) Shao et al. [2023] Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Patterson, D., Gonzalez, J., Le, Q., Liang, C., Munguia, L.-M., Rothchild, D., So, D., Texier, M., Dean, J.: Carbon emissions and large neural network training. arXiv preprint arXiv:2104.10350 (2021) Wu et al. [2022] Wu, C.-J., Raghavendra, R., Gupta, U., Acun, B., Ardalani, N., Maeng, K., Chang, G., Aga, F., Huang, J., Bai, C., et al.: Sustainable ai: Environmental implications, challenges and opportunities. Proceedings of Machine Learning and Systems 4, 795–813 (2022) Kaack et al. [2022] Kaack, L.H., Donti, P.L., Strubell, E., Kamiya, G., Creutzig, F., Rolnick, D.: Aligning artificial intelligence with climate change mitigation. Nature Climate Change 12(6), 518–527 (2022) Zhou et al. [2022] Zhou, H., Nova, A., Larochelle, H., Courville, A., Neyshabur, B., Sedghi, H.: Teaching algorithmic reasoning via in-context learning. arXiv preprint arXiv:2211.09066 (2022) Izacard et al. [2022] Izacard, G., Lewis, P., Lomeli, M., Hosseini, L., Petroni, F., Schick, T., Dwivedi-Yu, J., Joulin, A., Riedel, S., Grave, E.: Few-shot learning with retrieval augmented language models. arXiv preprint arXiv:2208.03299 (2022) Wei et al. [2022] Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E., Le, Q., Zhou, D.: Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903 (2022) Wang et al. [2023] Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-Consistency Improves Chain of Thought Reasoning in Language Models (2023) Shao et al. [2023] Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Wu, C.-J., Raghavendra, R., Gupta, U., Acun, B., Ardalani, N., Maeng, K., Chang, G., Aga, F., Huang, J., Bai, C., et al.: Sustainable ai: Environmental implications, challenges and opportunities. Proceedings of Machine Learning and Systems 4, 795–813 (2022) Kaack et al. [2022] Kaack, L.H., Donti, P.L., Strubell, E., Kamiya, G., Creutzig, F., Rolnick, D.: Aligning artificial intelligence with climate change mitigation. Nature Climate Change 12(6), 518–527 (2022) Zhou et al. [2022] Zhou, H., Nova, A., Larochelle, H., Courville, A., Neyshabur, B., Sedghi, H.: Teaching algorithmic reasoning via in-context learning. arXiv preprint arXiv:2211.09066 (2022) Izacard et al. [2022] Izacard, G., Lewis, P., Lomeli, M., Hosseini, L., Petroni, F., Schick, T., Dwivedi-Yu, J., Joulin, A., Riedel, S., Grave, E.: Few-shot learning with retrieval augmented language models. arXiv preprint arXiv:2208.03299 (2022) Wei et al. [2022] Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E., Le, Q., Zhou, D.: Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903 (2022) Wang et al. [2023] Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-Consistency Improves Chain of Thought Reasoning in Language Models (2023) Shao et al. [2023] Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Kaack, L.H., Donti, P.L., Strubell, E., Kamiya, G., Creutzig, F., Rolnick, D.: Aligning artificial intelligence with climate change mitigation. Nature Climate Change 12(6), 518–527 (2022) Zhou et al. [2022] Zhou, H., Nova, A., Larochelle, H., Courville, A., Neyshabur, B., Sedghi, H.: Teaching algorithmic reasoning via in-context learning. arXiv preprint arXiv:2211.09066 (2022) Izacard et al. [2022] Izacard, G., Lewis, P., Lomeli, M., Hosseini, L., Petroni, F., Schick, T., Dwivedi-Yu, J., Joulin, A., Riedel, S., Grave, E.: Few-shot learning with retrieval augmented language models. arXiv preprint arXiv:2208.03299 (2022) Wei et al. [2022] Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E., Le, Q., Zhou, D.: Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903 (2022) Wang et al. [2023] Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-Consistency Improves Chain of Thought Reasoning in Language Models (2023) Shao et al. [2023] Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Zhou, H., Nova, A., Larochelle, H., Courville, A., Neyshabur, B., Sedghi, H.: Teaching algorithmic reasoning via in-context learning. arXiv preprint arXiv:2211.09066 (2022) Izacard et al. [2022] Izacard, G., Lewis, P., Lomeli, M., Hosseini, L., Petroni, F., Schick, T., Dwivedi-Yu, J., Joulin, A., Riedel, S., Grave, E.: Few-shot learning with retrieval augmented language models. arXiv preprint arXiv:2208.03299 (2022) Wei et al. [2022] Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E., Le, Q., Zhou, D.: Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903 (2022) Wang et al. [2023] Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-Consistency Improves Chain of Thought Reasoning in Language Models (2023) Shao et al. [2023] Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Izacard, G., Lewis, P., Lomeli, M., Hosseini, L., Petroni, F., Schick, T., Dwivedi-Yu, J., Joulin, A., Riedel, S., Grave, E.: Few-shot learning with retrieval augmented language models. arXiv preprint arXiv:2208.03299 (2022) Wei et al. [2022] Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E., Le, Q., Zhou, D.: Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903 (2022) Wang et al. [2023] Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-Consistency Improves Chain of Thought Reasoning in Language Models (2023) Shao et al. [2023] Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E., Le, Q., Zhou, D.: Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903 (2022) Wang et al. [2023] Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-Consistency Improves Chain of Thought Reasoning in Language Models (2023) Shao et al. [2023] Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-Consistency Improves Chain of Thought Reasoning in Language Models (2023) Shao et al. [2023] Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023)
  12. Ji, Z., Lee, N., Frieske, R., Yu, T., Su, D., Xu, Y., Ishii, E., Bang, Y.J., Madotto, A., Fung, P.: Survey of hallucination in natural language generation. ACM Comput. Surv. 55(12) (2023) https://doi.org/10.1145/3571730 Strubell et al. [2020] Strubell, E., Ganesh, A., McCallum, A.: Energy and policy considerations for modern deep learning research. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, pp. 13693–13696 (2020) Patterson et al. [2021] Patterson, D., Gonzalez, J., Le, Q., Liang, C., Munguia, L.-M., Rothchild, D., So, D., Texier, M., Dean, J.: Carbon emissions and large neural network training. arXiv preprint arXiv:2104.10350 (2021) Wu et al. [2022] Wu, C.-J., Raghavendra, R., Gupta, U., Acun, B., Ardalani, N., Maeng, K., Chang, G., Aga, F., Huang, J., Bai, C., et al.: Sustainable ai: Environmental implications, challenges and opportunities. Proceedings of Machine Learning and Systems 4, 795–813 (2022) Kaack et al. [2022] Kaack, L.H., Donti, P.L., Strubell, E., Kamiya, G., Creutzig, F., Rolnick, D.: Aligning artificial intelligence with climate change mitigation. Nature Climate Change 12(6), 518–527 (2022) Zhou et al. [2022] Zhou, H., Nova, A., Larochelle, H., Courville, A., Neyshabur, B., Sedghi, H.: Teaching algorithmic reasoning via in-context learning. arXiv preprint arXiv:2211.09066 (2022) Izacard et al. [2022] Izacard, G., Lewis, P., Lomeli, M., Hosseini, L., Petroni, F., Schick, T., Dwivedi-Yu, J., Joulin, A., Riedel, S., Grave, E.: Few-shot learning with retrieval augmented language models. arXiv preprint arXiv:2208.03299 (2022) Wei et al. [2022] Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E., Le, Q., Zhou, D.: Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903 (2022) Wang et al. [2023] Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-Consistency Improves Chain of Thought Reasoning in Language Models (2023) Shao et al. [2023] Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Strubell, E., Ganesh, A., McCallum, A.: Energy and policy considerations for modern deep learning research. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, pp. 13693–13696 (2020) Patterson et al. [2021] Patterson, D., Gonzalez, J., Le, Q., Liang, C., Munguia, L.-M., Rothchild, D., So, D., Texier, M., Dean, J.: Carbon emissions and large neural network training. arXiv preprint arXiv:2104.10350 (2021) Wu et al. [2022] Wu, C.-J., Raghavendra, R., Gupta, U., Acun, B., Ardalani, N., Maeng, K., Chang, G., Aga, F., Huang, J., Bai, C., et al.: Sustainable ai: Environmental implications, challenges and opportunities. Proceedings of Machine Learning and Systems 4, 795–813 (2022) Kaack et al. [2022] Kaack, L.H., Donti, P.L., Strubell, E., Kamiya, G., Creutzig, F., Rolnick, D.: Aligning artificial intelligence with climate change mitigation. Nature Climate Change 12(6), 518–527 (2022) Zhou et al. [2022] Zhou, H., Nova, A., Larochelle, H., Courville, A., Neyshabur, B., Sedghi, H.: Teaching algorithmic reasoning via in-context learning. arXiv preprint arXiv:2211.09066 (2022) Izacard et al. [2022] Izacard, G., Lewis, P., Lomeli, M., Hosseini, L., Petroni, F., Schick, T., Dwivedi-Yu, J., Joulin, A., Riedel, S., Grave, E.: Few-shot learning with retrieval augmented language models. arXiv preprint arXiv:2208.03299 (2022) Wei et al. [2022] Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E., Le, Q., Zhou, D.: Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903 (2022) Wang et al. [2023] Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-Consistency Improves Chain of Thought Reasoning in Language Models (2023) Shao et al. [2023] Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Patterson, D., Gonzalez, J., Le, Q., Liang, C., Munguia, L.-M., Rothchild, D., So, D., Texier, M., Dean, J.: Carbon emissions and large neural network training. arXiv preprint arXiv:2104.10350 (2021) Wu et al. [2022] Wu, C.-J., Raghavendra, R., Gupta, U., Acun, B., Ardalani, N., Maeng, K., Chang, G., Aga, F., Huang, J., Bai, C., et al.: Sustainable ai: Environmental implications, challenges and opportunities. Proceedings of Machine Learning and Systems 4, 795–813 (2022) Kaack et al. [2022] Kaack, L.H., Donti, P.L., Strubell, E., Kamiya, G., Creutzig, F., Rolnick, D.: Aligning artificial intelligence with climate change mitigation. Nature Climate Change 12(6), 518–527 (2022) Zhou et al. [2022] Zhou, H., Nova, A., Larochelle, H., Courville, A., Neyshabur, B., Sedghi, H.: Teaching algorithmic reasoning via in-context learning. arXiv preprint arXiv:2211.09066 (2022) Izacard et al. [2022] Izacard, G., Lewis, P., Lomeli, M., Hosseini, L., Petroni, F., Schick, T., Dwivedi-Yu, J., Joulin, A., Riedel, S., Grave, E.: Few-shot learning with retrieval augmented language models. arXiv preprint arXiv:2208.03299 (2022) Wei et al. [2022] Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E., Le, Q., Zhou, D.: Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903 (2022) Wang et al. [2023] Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-Consistency Improves Chain of Thought Reasoning in Language Models (2023) Shao et al. [2023] Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Wu, C.-J., Raghavendra, R., Gupta, U., Acun, B., Ardalani, N., Maeng, K., Chang, G., Aga, F., Huang, J., Bai, C., et al.: Sustainable ai: Environmental implications, challenges and opportunities. Proceedings of Machine Learning and Systems 4, 795–813 (2022) Kaack et al. [2022] Kaack, L.H., Donti, P.L., Strubell, E., Kamiya, G., Creutzig, F., Rolnick, D.: Aligning artificial intelligence with climate change mitigation. Nature Climate Change 12(6), 518–527 (2022) Zhou et al. [2022] Zhou, H., Nova, A., Larochelle, H., Courville, A., Neyshabur, B., Sedghi, H.: Teaching algorithmic reasoning via in-context learning. arXiv preprint arXiv:2211.09066 (2022) Izacard et al. [2022] Izacard, G., Lewis, P., Lomeli, M., Hosseini, L., Petroni, F., Schick, T., Dwivedi-Yu, J., Joulin, A., Riedel, S., Grave, E.: Few-shot learning with retrieval augmented language models. arXiv preprint arXiv:2208.03299 (2022) Wei et al. [2022] Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E., Le, Q., Zhou, D.: Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903 (2022) Wang et al. [2023] Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-Consistency Improves Chain of Thought Reasoning in Language Models (2023) Shao et al. [2023] Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Kaack, L.H., Donti, P.L., Strubell, E., Kamiya, G., Creutzig, F., Rolnick, D.: Aligning artificial intelligence with climate change mitigation. Nature Climate Change 12(6), 518–527 (2022) Zhou et al. [2022] Zhou, H., Nova, A., Larochelle, H., Courville, A., Neyshabur, B., Sedghi, H.: Teaching algorithmic reasoning via in-context learning. arXiv preprint arXiv:2211.09066 (2022) Izacard et al. [2022] Izacard, G., Lewis, P., Lomeli, M., Hosseini, L., Petroni, F., Schick, T., Dwivedi-Yu, J., Joulin, A., Riedel, S., Grave, E.: Few-shot learning with retrieval augmented language models. arXiv preprint arXiv:2208.03299 (2022) Wei et al. [2022] Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E., Le, Q., Zhou, D.: Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903 (2022) Wang et al. [2023] Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-Consistency Improves Chain of Thought Reasoning in Language Models (2023) Shao et al. [2023] Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Zhou, H., Nova, A., Larochelle, H., Courville, A., Neyshabur, B., Sedghi, H.: Teaching algorithmic reasoning via in-context learning. arXiv preprint arXiv:2211.09066 (2022) Izacard et al. [2022] Izacard, G., Lewis, P., Lomeli, M., Hosseini, L., Petroni, F., Schick, T., Dwivedi-Yu, J., Joulin, A., Riedel, S., Grave, E.: Few-shot learning with retrieval augmented language models. arXiv preprint arXiv:2208.03299 (2022) Wei et al. [2022] Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E., Le, Q., Zhou, D.: Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903 (2022) Wang et al. [2023] Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-Consistency Improves Chain of Thought Reasoning in Language Models (2023) Shao et al. [2023] Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Izacard, G., Lewis, P., Lomeli, M., Hosseini, L., Petroni, F., Schick, T., Dwivedi-Yu, J., Joulin, A., Riedel, S., Grave, E.: Few-shot learning with retrieval augmented language models. arXiv preprint arXiv:2208.03299 (2022) Wei et al. [2022] Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E., Le, Q., Zhou, D.: Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903 (2022) Wang et al. [2023] Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-Consistency Improves Chain of Thought Reasoning in Language Models (2023) Shao et al. [2023] Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E., Le, Q., Zhou, D.: Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903 (2022) Wang et al. [2023] Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-Consistency Improves Chain of Thought Reasoning in Language Models (2023) Shao et al. [2023] Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-Consistency Improves Chain of Thought Reasoning in Language Models (2023) Shao et al. [2023] Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023)
  13. Strubell, E., Ganesh, A., McCallum, A.: Energy and policy considerations for modern deep learning research. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, pp. 13693–13696 (2020) Patterson et al. [2021] Patterson, D., Gonzalez, J., Le, Q., Liang, C., Munguia, L.-M., Rothchild, D., So, D., Texier, M., Dean, J.: Carbon emissions and large neural network training. arXiv preprint arXiv:2104.10350 (2021) Wu et al. [2022] Wu, C.-J., Raghavendra, R., Gupta, U., Acun, B., Ardalani, N., Maeng, K., Chang, G., Aga, F., Huang, J., Bai, C., et al.: Sustainable ai: Environmental implications, challenges and opportunities. Proceedings of Machine Learning and Systems 4, 795–813 (2022) Kaack et al. [2022] Kaack, L.H., Donti, P.L., Strubell, E., Kamiya, G., Creutzig, F., Rolnick, D.: Aligning artificial intelligence with climate change mitigation. Nature Climate Change 12(6), 518–527 (2022) Zhou et al. [2022] Zhou, H., Nova, A., Larochelle, H., Courville, A., Neyshabur, B., Sedghi, H.: Teaching algorithmic reasoning via in-context learning. arXiv preprint arXiv:2211.09066 (2022) Izacard et al. [2022] Izacard, G., Lewis, P., Lomeli, M., Hosseini, L., Petroni, F., Schick, T., Dwivedi-Yu, J., Joulin, A., Riedel, S., Grave, E.: Few-shot learning with retrieval augmented language models. arXiv preprint arXiv:2208.03299 (2022) Wei et al. [2022] Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E., Le, Q., Zhou, D.: Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903 (2022) Wang et al. [2023] Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-Consistency Improves Chain of Thought Reasoning in Language Models (2023) Shao et al. [2023] Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Patterson, D., Gonzalez, J., Le, Q., Liang, C., Munguia, L.-M., Rothchild, D., So, D., Texier, M., Dean, J.: Carbon emissions and large neural network training. arXiv preprint arXiv:2104.10350 (2021) Wu et al. [2022] Wu, C.-J., Raghavendra, R., Gupta, U., Acun, B., Ardalani, N., Maeng, K., Chang, G., Aga, F., Huang, J., Bai, C., et al.: Sustainable ai: Environmental implications, challenges and opportunities. Proceedings of Machine Learning and Systems 4, 795–813 (2022) Kaack et al. [2022] Kaack, L.H., Donti, P.L., Strubell, E., Kamiya, G., Creutzig, F., Rolnick, D.: Aligning artificial intelligence with climate change mitigation. Nature Climate Change 12(6), 518–527 (2022) Zhou et al. [2022] Zhou, H., Nova, A., Larochelle, H., Courville, A., Neyshabur, B., Sedghi, H.: Teaching algorithmic reasoning via in-context learning. arXiv preprint arXiv:2211.09066 (2022) Izacard et al. [2022] Izacard, G., Lewis, P., Lomeli, M., Hosseini, L., Petroni, F., Schick, T., Dwivedi-Yu, J., Joulin, A., Riedel, S., Grave, E.: Few-shot learning with retrieval augmented language models. arXiv preprint arXiv:2208.03299 (2022) Wei et al. [2022] Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E., Le, Q., Zhou, D.: Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903 (2022) Wang et al. [2023] Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-Consistency Improves Chain of Thought Reasoning in Language Models (2023) Shao et al. [2023] Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Wu, C.-J., Raghavendra, R., Gupta, U., Acun, B., Ardalani, N., Maeng, K., Chang, G., Aga, F., Huang, J., Bai, C., et al.: Sustainable ai: Environmental implications, challenges and opportunities. Proceedings of Machine Learning and Systems 4, 795–813 (2022) Kaack et al. [2022] Kaack, L.H., Donti, P.L., Strubell, E., Kamiya, G., Creutzig, F., Rolnick, D.: Aligning artificial intelligence with climate change mitigation. Nature Climate Change 12(6), 518–527 (2022) Zhou et al. [2022] Zhou, H., Nova, A., Larochelle, H., Courville, A., Neyshabur, B., Sedghi, H.: Teaching algorithmic reasoning via in-context learning. arXiv preprint arXiv:2211.09066 (2022) Izacard et al. [2022] Izacard, G., Lewis, P., Lomeli, M., Hosseini, L., Petroni, F., Schick, T., Dwivedi-Yu, J., Joulin, A., Riedel, S., Grave, E.: Few-shot learning with retrieval augmented language models. arXiv preprint arXiv:2208.03299 (2022) Wei et al. [2022] Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E., Le, Q., Zhou, D.: Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903 (2022) Wang et al. [2023] Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-Consistency Improves Chain of Thought Reasoning in Language Models (2023) Shao et al. [2023] Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Kaack, L.H., Donti, P.L., Strubell, E., Kamiya, G., Creutzig, F., Rolnick, D.: Aligning artificial intelligence with climate change mitigation. Nature Climate Change 12(6), 518–527 (2022) Zhou et al. [2022] Zhou, H., Nova, A., Larochelle, H., Courville, A., Neyshabur, B., Sedghi, H.: Teaching algorithmic reasoning via in-context learning. arXiv preprint arXiv:2211.09066 (2022) Izacard et al. [2022] Izacard, G., Lewis, P., Lomeli, M., Hosseini, L., Petroni, F., Schick, T., Dwivedi-Yu, J., Joulin, A., Riedel, S., Grave, E.: Few-shot learning with retrieval augmented language models. arXiv preprint arXiv:2208.03299 (2022) Wei et al. [2022] Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E., Le, Q., Zhou, D.: Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903 (2022) Wang et al. [2023] Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-Consistency Improves Chain of Thought Reasoning in Language Models (2023) Shao et al. [2023] Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Zhou, H., Nova, A., Larochelle, H., Courville, A., Neyshabur, B., Sedghi, H.: Teaching algorithmic reasoning via in-context learning. arXiv preprint arXiv:2211.09066 (2022) Izacard et al. [2022] Izacard, G., Lewis, P., Lomeli, M., Hosseini, L., Petroni, F., Schick, T., Dwivedi-Yu, J., Joulin, A., Riedel, S., Grave, E.: Few-shot learning with retrieval augmented language models. arXiv preprint arXiv:2208.03299 (2022) Wei et al. [2022] Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E., Le, Q., Zhou, D.: Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903 (2022) Wang et al. [2023] Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-Consistency Improves Chain of Thought Reasoning in Language Models (2023) Shao et al. [2023] Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Izacard, G., Lewis, P., Lomeli, M., Hosseini, L., Petroni, F., Schick, T., Dwivedi-Yu, J., Joulin, A., Riedel, S., Grave, E.: Few-shot learning with retrieval augmented language models. arXiv preprint arXiv:2208.03299 (2022) Wei et al. [2022] Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E., Le, Q., Zhou, D.: Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903 (2022) Wang et al. [2023] Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-Consistency Improves Chain of Thought Reasoning in Language Models (2023) Shao et al. [2023] Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E., Le, Q., Zhou, D.: Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903 (2022) Wang et al. [2023] Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-Consistency Improves Chain of Thought Reasoning in Language Models (2023) Shao et al. [2023] Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-Consistency Improves Chain of Thought Reasoning in Language Models (2023) Shao et al. [2023] Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023)
  14. Patterson, D., Gonzalez, J., Le, Q., Liang, C., Munguia, L.-M., Rothchild, D., So, D., Texier, M., Dean, J.: Carbon emissions and large neural network training. arXiv preprint arXiv:2104.10350 (2021) Wu et al. [2022] Wu, C.-J., Raghavendra, R., Gupta, U., Acun, B., Ardalani, N., Maeng, K., Chang, G., Aga, F., Huang, J., Bai, C., et al.: Sustainable ai: Environmental implications, challenges and opportunities. Proceedings of Machine Learning and Systems 4, 795–813 (2022) Kaack et al. [2022] Kaack, L.H., Donti, P.L., Strubell, E., Kamiya, G., Creutzig, F., Rolnick, D.: Aligning artificial intelligence with climate change mitigation. Nature Climate Change 12(6), 518–527 (2022) Zhou et al. [2022] Zhou, H., Nova, A., Larochelle, H., Courville, A., Neyshabur, B., Sedghi, H.: Teaching algorithmic reasoning via in-context learning. arXiv preprint arXiv:2211.09066 (2022) Izacard et al. [2022] Izacard, G., Lewis, P., Lomeli, M., Hosseini, L., Petroni, F., Schick, T., Dwivedi-Yu, J., Joulin, A., Riedel, S., Grave, E.: Few-shot learning with retrieval augmented language models. arXiv preprint arXiv:2208.03299 (2022) Wei et al. [2022] Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E., Le, Q., Zhou, D.: Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903 (2022) Wang et al. [2023] Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-Consistency Improves Chain of Thought Reasoning in Language Models (2023) Shao et al. [2023] Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Wu, C.-J., Raghavendra, R., Gupta, U., Acun, B., Ardalani, N., Maeng, K., Chang, G., Aga, F., Huang, J., Bai, C., et al.: Sustainable ai: Environmental implications, challenges and opportunities. Proceedings of Machine Learning and Systems 4, 795–813 (2022) Kaack et al. [2022] Kaack, L.H., Donti, P.L., Strubell, E., Kamiya, G., Creutzig, F., Rolnick, D.: Aligning artificial intelligence with climate change mitigation. Nature Climate Change 12(6), 518–527 (2022) Zhou et al. [2022] Zhou, H., Nova, A., Larochelle, H., Courville, A., Neyshabur, B., Sedghi, H.: Teaching algorithmic reasoning via in-context learning. arXiv preprint arXiv:2211.09066 (2022) Izacard et al. [2022] Izacard, G., Lewis, P., Lomeli, M., Hosseini, L., Petroni, F., Schick, T., Dwivedi-Yu, J., Joulin, A., Riedel, S., Grave, E.: Few-shot learning with retrieval augmented language models. arXiv preprint arXiv:2208.03299 (2022) Wei et al. [2022] Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E., Le, Q., Zhou, D.: Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903 (2022) Wang et al. [2023] Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-Consistency Improves Chain of Thought Reasoning in Language Models (2023) Shao et al. [2023] Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Kaack, L.H., Donti, P.L., Strubell, E., Kamiya, G., Creutzig, F., Rolnick, D.: Aligning artificial intelligence with climate change mitigation. Nature Climate Change 12(6), 518–527 (2022) Zhou et al. [2022] Zhou, H., Nova, A., Larochelle, H., Courville, A., Neyshabur, B., Sedghi, H.: Teaching algorithmic reasoning via in-context learning. arXiv preprint arXiv:2211.09066 (2022) Izacard et al. [2022] Izacard, G., Lewis, P., Lomeli, M., Hosseini, L., Petroni, F., Schick, T., Dwivedi-Yu, J., Joulin, A., Riedel, S., Grave, E.: Few-shot learning with retrieval augmented language models. arXiv preprint arXiv:2208.03299 (2022) Wei et al. [2022] Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E., Le, Q., Zhou, D.: Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903 (2022) Wang et al. [2023] Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-Consistency Improves Chain of Thought Reasoning in Language Models (2023) Shao et al. [2023] Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Zhou, H., Nova, A., Larochelle, H., Courville, A., Neyshabur, B., Sedghi, H.: Teaching algorithmic reasoning via in-context learning. arXiv preprint arXiv:2211.09066 (2022) Izacard et al. [2022] Izacard, G., Lewis, P., Lomeli, M., Hosseini, L., Petroni, F., Schick, T., Dwivedi-Yu, J., Joulin, A., Riedel, S., Grave, E.: Few-shot learning with retrieval augmented language models. arXiv preprint arXiv:2208.03299 (2022) Wei et al. [2022] Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E., Le, Q., Zhou, D.: Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903 (2022) Wang et al. [2023] Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-Consistency Improves Chain of Thought Reasoning in Language Models (2023) Shao et al. [2023] Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Izacard, G., Lewis, P., Lomeli, M., Hosseini, L., Petroni, F., Schick, T., Dwivedi-Yu, J., Joulin, A., Riedel, S., Grave, E.: Few-shot learning with retrieval augmented language models. arXiv preprint arXiv:2208.03299 (2022) Wei et al. [2022] Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E., Le, Q., Zhou, D.: Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903 (2022) Wang et al. [2023] Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-Consistency Improves Chain of Thought Reasoning in Language Models (2023) Shao et al. [2023] Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E., Le, Q., Zhou, D.: Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903 (2022) Wang et al. [2023] Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-Consistency Improves Chain of Thought Reasoning in Language Models (2023) Shao et al. [2023] Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-Consistency Improves Chain of Thought Reasoning in Language Models (2023) Shao et al. [2023] Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023)
  15. Wu, C.-J., Raghavendra, R., Gupta, U., Acun, B., Ardalani, N., Maeng, K., Chang, G., Aga, F., Huang, J., Bai, C., et al.: Sustainable ai: Environmental implications, challenges and opportunities. Proceedings of Machine Learning and Systems 4, 795–813 (2022) Kaack et al. [2022] Kaack, L.H., Donti, P.L., Strubell, E., Kamiya, G., Creutzig, F., Rolnick, D.: Aligning artificial intelligence with climate change mitigation. Nature Climate Change 12(6), 518–527 (2022) Zhou et al. [2022] Zhou, H., Nova, A., Larochelle, H., Courville, A., Neyshabur, B., Sedghi, H.: Teaching algorithmic reasoning via in-context learning. arXiv preprint arXiv:2211.09066 (2022) Izacard et al. [2022] Izacard, G., Lewis, P., Lomeli, M., Hosseini, L., Petroni, F., Schick, T., Dwivedi-Yu, J., Joulin, A., Riedel, S., Grave, E.: Few-shot learning with retrieval augmented language models. arXiv preprint arXiv:2208.03299 (2022) Wei et al. [2022] Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E., Le, Q., Zhou, D.: Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903 (2022) Wang et al. [2023] Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-Consistency Improves Chain of Thought Reasoning in Language Models (2023) Shao et al. [2023] Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Kaack, L.H., Donti, P.L., Strubell, E., Kamiya, G., Creutzig, F., Rolnick, D.: Aligning artificial intelligence with climate change mitigation. Nature Climate Change 12(6), 518–527 (2022) Zhou et al. [2022] Zhou, H., Nova, A., Larochelle, H., Courville, A., Neyshabur, B., Sedghi, H.: Teaching algorithmic reasoning via in-context learning. arXiv preprint arXiv:2211.09066 (2022) Izacard et al. [2022] Izacard, G., Lewis, P., Lomeli, M., Hosseini, L., Petroni, F., Schick, T., Dwivedi-Yu, J., Joulin, A., Riedel, S., Grave, E.: Few-shot learning with retrieval augmented language models. arXiv preprint arXiv:2208.03299 (2022) Wei et al. [2022] Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E., Le, Q., Zhou, D.: Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903 (2022) Wang et al. [2023] Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-Consistency Improves Chain of Thought Reasoning in Language Models (2023) Shao et al. [2023] Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Zhou, H., Nova, A., Larochelle, H., Courville, A., Neyshabur, B., Sedghi, H.: Teaching algorithmic reasoning via in-context learning. arXiv preprint arXiv:2211.09066 (2022) Izacard et al. [2022] Izacard, G., Lewis, P., Lomeli, M., Hosseini, L., Petroni, F., Schick, T., Dwivedi-Yu, J., Joulin, A., Riedel, S., Grave, E.: Few-shot learning with retrieval augmented language models. arXiv preprint arXiv:2208.03299 (2022) Wei et al. [2022] Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E., Le, Q., Zhou, D.: Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903 (2022) Wang et al. [2023] Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-Consistency Improves Chain of Thought Reasoning in Language Models (2023) Shao et al. [2023] Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Izacard, G., Lewis, P., Lomeli, M., Hosseini, L., Petroni, F., Schick, T., Dwivedi-Yu, J., Joulin, A., Riedel, S., Grave, E.: Few-shot learning with retrieval augmented language models. arXiv preprint arXiv:2208.03299 (2022) Wei et al. [2022] Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E., Le, Q., Zhou, D.: Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903 (2022) Wang et al. [2023] Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-Consistency Improves Chain of Thought Reasoning in Language Models (2023) Shao et al. [2023] Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E., Le, Q., Zhou, D.: Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903 (2022) Wang et al. [2023] Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-Consistency Improves Chain of Thought Reasoning in Language Models (2023) Shao et al. [2023] Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-Consistency Improves Chain of Thought Reasoning in Language Models (2023) Shao et al. [2023] Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023)
  16. Kaack, L.H., Donti, P.L., Strubell, E., Kamiya, G., Creutzig, F., Rolnick, D.: Aligning artificial intelligence with climate change mitigation. Nature Climate Change 12(6), 518–527 (2022) Zhou et al. [2022] Zhou, H., Nova, A., Larochelle, H., Courville, A., Neyshabur, B., Sedghi, H.: Teaching algorithmic reasoning via in-context learning. arXiv preprint arXiv:2211.09066 (2022) Izacard et al. [2022] Izacard, G., Lewis, P., Lomeli, M., Hosseini, L., Petroni, F., Schick, T., Dwivedi-Yu, J., Joulin, A., Riedel, S., Grave, E.: Few-shot learning with retrieval augmented language models. arXiv preprint arXiv:2208.03299 (2022) Wei et al. [2022] Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E., Le, Q., Zhou, D.: Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903 (2022) Wang et al. [2023] Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-Consistency Improves Chain of Thought Reasoning in Language Models (2023) Shao et al. [2023] Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Zhou, H., Nova, A., Larochelle, H., Courville, A., Neyshabur, B., Sedghi, H.: Teaching algorithmic reasoning via in-context learning. arXiv preprint arXiv:2211.09066 (2022) Izacard et al. [2022] Izacard, G., Lewis, P., Lomeli, M., Hosseini, L., Petroni, F., Schick, T., Dwivedi-Yu, J., Joulin, A., Riedel, S., Grave, E.: Few-shot learning with retrieval augmented language models. arXiv preprint arXiv:2208.03299 (2022) Wei et al. [2022] Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E., Le, Q., Zhou, D.: Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903 (2022) Wang et al. [2023] Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-Consistency Improves Chain of Thought Reasoning in Language Models (2023) Shao et al. [2023] Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Izacard, G., Lewis, P., Lomeli, M., Hosseini, L., Petroni, F., Schick, T., Dwivedi-Yu, J., Joulin, A., Riedel, S., Grave, E.: Few-shot learning with retrieval augmented language models. arXiv preprint arXiv:2208.03299 (2022) Wei et al. [2022] Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E., Le, Q., Zhou, D.: Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903 (2022) Wang et al. [2023] Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-Consistency Improves Chain of Thought Reasoning in Language Models (2023) Shao et al. [2023] Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E., Le, Q., Zhou, D.: Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903 (2022) Wang et al. [2023] Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-Consistency Improves Chain of Thought Reasoning in Language Models (2023) Shao et al. [2023] Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-Consistency Improves Chain of Thought Reasoning in Language Models (2023) Shao et al. [2023] Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023)
  17. Zhou, H., Nova, A., Larochelle, H., Courville, A., Neyshabur, B., Sedghi, H.: Teaching algorithmic reasoning via in-context learning. arXiv preprint arXiv:2211.09066 (2022) Izacard et al. [2022] Izacard, G., Lewis, P., Lomeli, M., Hosseini, L., Petroni, F., Schick, T., Dwivedi-Yu, J., Joulin, A., Riedel, S., Grave, E.: Few-shot learning with retrieval augmented language models. arXiv preprint arXiv:2208.03299 (2022) Wei et al. [2022] Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E., Le, Q., Zhou, D.: Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903 (2022) Wang et al. [2023] Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-Consistency Improves Chain of Thought Reasoning in Language Models (2023) Shao et al. [2023] Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Izacard, G., Lewis, P., Lomeli, M., Hosseini, L., Petroni, F., Schick, T., Dwivedi-Yu, J., Joulin, A., Riedel, S., Grave, E.: Few-shot learning with retrieval augmented language models. arXiv preprint arXiv:2208.03299 (2022) Wei et al. [2022] Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E., Le, Q., Zhou, D.: Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903 (2022) Wang et al. [2023] Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-Consistency Improves Chain of Thought Reasoning in Language Models (2023) Shao et al. [2023] Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E., Le, Q., Zhou, D.: Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903 (2022) Wang et al. [2023] Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-Consistency Improves Chain of Thought Reasoning in Language Models (2023) Shao et al. [2023] Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-Consistency Improves Chain of Thought Reasoning in Language Models (2023) Shao et al. [2023] Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023)
  18. Izacard, G., Lewis, P., Lomeli, M., Hosseini, L., Petroni, F., Schick, T., Dwivedi-Yu, J., Joulin, A., Riedel, S., Grave, E.: Few-shot learning with retrieval augmented language models. arXiv preprint arXiv:2208.03299 (2022) Wei et al. [2022] Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E., Le, Q., Zhou, D.: Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903 (2022) Wang et al. [2023] Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-Consistency Improves Chain of Thought Reasoning in Language Models (2023) Shao et al. [2023] Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E., Le, Q., Zhou, D.: Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903 (2022) Wang et al. [2023] Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-Consistency Improves Chain of Thought Reasoning in Language Models (2023) Shao et al. [2023] Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-Consistency Improves Chain of Thought Reasoning in Language Models (2023) Shao et al. [2023] Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023)
  19. Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E., Le, Q., Zhou, D.: Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903 (2022) Wang et al. [2023] Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-Consistency Improves Chain of Thought Reasoning in Language Models (2023) Shao et al. [2023] Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-Consistency Improves Chain of Thought Reasoning in Language Models (2023) Shao et al. [2023] Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023)
  20. Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-Consistency Improves Chain of Thought Reasoning in Language Models (2023) Shao et al. [2023] Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023)
  21. Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023)
  22. Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023)
  23. Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023)
  24. Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023)
  25. Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023)
  26. Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023)
  27. Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023)
  28. Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023)
  29. Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023)
  30. On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023)
  31. Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023)
  32. Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023)
  33. Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023)
  34. Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023)
  35. Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023)
  36. Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023)
  37. Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023)
  38. Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023)
  39. Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023)
  40. Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023)
  41. Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023)
  42. Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023)
  43. Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023)
  44. Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023)
  45. Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023)
  46. Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023)
  47. Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023)
  48. Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023)
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (3)
  1. Erik Hemberg (27 papers)
  2. Stephen Moskal (6 papers)
  3. Una-May O'Reilly (43 papers)
Citations (14)

Summary

  • The paper introduces the LLM_GP framework that uses LLMs as evolutionary operators to evolve code, replacing traditional genetic programming methods.
  • It presents a simplified variant complete with source code, enabling researchers and practitioners to explore and evaluate the model.
  • The study highlights challenges such as prompt engineering complexities, data biases, and the inherent unpredictability of LLM outputs.

Introduction to LLM-Based Evolutionary Algorithms

Evolutionary algorithms (EAs) have long been inspired by natural evolution to optimize solutions for a myriad of complex problems. However, the integration of LLMs into this process is a relatively new and innovative frontier. It is in this context that the approach designated LLM_GP emerges. LLM_GP represents a formal LLM-based evolutionary algorithm with a distinctive ability: it evolves code.

The LLM_GP Framework

The LLM_GP system distinguishes itself from traditional genetic programming (GP) by how it employs evolutionary operators. In LLM_GP, these operators do not manipulate code structures directly. Instead, they leverage the pre-trained capabilities of LLMs—through tailored prompts—to execute tasks such as initializing candidate solutions, selecting the fittest, and introducing variations such as mutations or recombinations. This is fundamentally different from the traditional GP, where the manipulation of symbolic expressions or parse trees typically takes place.

To facilitate understanding, the authors have also provided a simplified variant of LLM_GP, complete with source code, aimed at demystifying the process for researchers and practitioners eager to explore this approach.

LLMs in Evolutionary Computing

LLMs are well-suited for tasks involving natural language processing thanks to their training on vast sets of textual data. They possess an impressive ability to complete text sequences, matching patterns found in their training set. These capabilities are the cornerstone upon which LLM_GP operates. It is their proficiency in generating code blocks and their pre-trained knowledge of code patterns that allow LLMs to effectively function as substitute genetic operators within LLM_GP algorithms.

Current Landscape and Challenges

While LLM_GP holds promise, it does not come without its share of challenges. The intricacies of pre-training an LLM, its cost implications, and the necessity of 'prompt engineering' are just a few barriers to entry. Moreover, LLMs suffer from issues such as potential data biases, hallucinations (generation of incorrect or nonsensical content), and the general unpredictability associated with their generative nature.

Despite these hurdles, the potential of LLM_GP to evolve more efficient and potentially innovative code cannot be ignored. The novel interplay between evolutionary computation principles and LLMs may yet unlock new levels of problem-solving capabilities. Going forward, it will be vital to engage rigorously with the nuanced mechanics of LLMs to maximize the effectiveness and scientific validity of LLM_GP implementations.

In conclusion, LLM_GP represents a bold step towards evolving code using the intricate pattern recognition and completion capabilities inherent to LLMs. Although the approach is nascent with considerable challenges to navigate, it shines a light on the exciting crossroads of evolutionary algorithms and advanced LLMs, opening doors to new methods of program synthesis.

X Twitter Logo Streamline Icon: https://streamlinehq.com