2000 character limit reached
Evolving Code with A Large Language Model (2401.07102v1)
Published 13 Jan 2024 in cs.NE and cs.AI
Abstract: Algorithms that use LLMs to evolve code arrived on the Genetic Programming (GP) scene very recently. We present LLM GP, a formalized LLM-based evolutionary algorithm designed to evolve code. Like GP, it uses evolutionary operators, but its designs and implementations of those operators radically differ from GP's because they enlist an LLM, using prompting and the LLM's pre-trained pattern matching and sequence completion capability. We also present a demonstration-level variant of LLM GP and share its code. By addressing algorithms that range from the formal to hands-on, we cover design and LLM-usage considerations as well as the scientific challenges that arise when using an LLM for genetic programming.
- Griffith, S., Subramanian, K., Scholz, J., Isbell, C.L., Thomaz, A.L.: Policy shaping: Integrating human feedback with reinforcement learning. Advances in neural information processing systems 26 (2013) Bradley et al. [2024] Bradley, H., Fan, H., Galanos, T., Zhou, R., Scott, D., Lehman, J.: The openelm library: Leveraging progress in language models for novel evolutionary algorithms. In: Genetic Programming Theory and Practice XX. Springer, ??? (2024) Chen et al. [2023] Chen, A., Dohan, D.M., So, D.R.: Evoprompting: Language models for code-level neural architecture search. arXiv preprint arXiv:2302.14838 (2023) Liventsev et al. [2023] Liventsev, V., Grishina, A., Härmä, A., Moonen, L.: Fully autonomous programming with large language models. arXiv preprint arXiv:2304.10423 (2023) O’Neill et al. [2010] O’Neill, M., Vanneschi, L., Gustafson, S., Banzhaf, W.: Open issues in genetic programming. Genetic Programming and Evolvable Machines 11, 339–363 (2010) O’Neill and Spector [2020] O’Neill, M., Spector, L.: Automatic programming: The open issue? Genetic Programming and Evolvable Machines 21, 251–262 (2020) Liu et al. [2023] Liu, P., Yuan, W., Fu, J., Jiang, Z., Hayashi, H., Neubig, G.: Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing. ACM Computing Surveys 55(9), 1–35 (2023) Radford et al. [2019] Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., Sutskever, I., et al.: Language models are unsupervised multitask learners. OpenAI blog 1(8), 9 (2019) Brown et al. [2020] Brown, T.B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D.M., Wu, J., Winter, C., Hesse, C., Chen, M., Sigler, E., Litwin, M., Gray, S., Chess, B., Clark, J., Berner, C., McCandlish, S., Radford, A., Sutskever, I., Amodei, D.: Language Models are Few-Shot Learners (2020) OpenAI [2023] OpenAI: GPT-4 Technical Report (2023) Phuong and Hutter [2022] Phuong, M., Hutter, M.: Formal algorithms for transformers. arXiv preprint arXiv:2207.09238 (2022) Ji et al. [2023] Ji, Z., Lee, N., Frieske, R., Yu, T., Su, D., Xu, Y., Ishii, E., Bang, Y.J., Madotto, A., Fung, P.: Survey of hallucination in natural language generation. ACM Comput. Surv. 55(12) (2023) https://doi.org/10.1145/3571730 Strubell et al. [2020] Strubell, E., Ganesh, A., McCallum, A.: Energy and policy considerations for modern deep learning research. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, pp. 13693–13696 (2020) Patterson et al. [2021] Patterson, D., Gonzalez, J., Le, Q., Liang, C., Munguia, L.-M., Rothchild, D., So, D., Texier, M., Dean, J.: Carbon emissions and large neural network training. arXiv preprint arXiv:2104.10350 (2021) Wu et al. [2022] Wu, C.-J., Raghavendra, R., Gupta, U., Acun, B., Ardalani, N., Maeng, K., Chang, G., Aga, F., Huang, J., Bai, C., et al.: Sustainable ai: Environmental implications, challenges and opportunities. Proceedings of Machine Learning and Systems 4, 795–813 (2022) Kaack et al. [2022] Kaack, L.H., Donti, P.L., Strubell, E., Kamiya, G., Creutzig, F., Rolnick, D.: Aligning artificial intelligence with climate change mitigation. Nature Climate Change 12(6), 518–527 (2022) Zhou et al. [2022] Zhou, H., Nova, A., Larochelle, H., Courville, A., Neyshabur, B., Sedghi, H.: Teaching algorithmic reasoning via in-context learning. arXiv preprint arXiv:2211.09066 (2022) Izacard et al. [2022] Izacard, G., Lewis, P., Lomeli, M., Hosseini, L., Petroni, F., Schick, T., Dwivedi-Yu, J., Joulin, A., Riedel, S., Grave, E.: Few-shot learning with retrieval augmented language models. arXiv preprint arXiv:2208.03299 (2022) Wei et al. [2022] Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E., Le, Q., Zhou, D.: Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903 (2022) Wang et al. [2023] Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-Consistency Improves Chain of Thought Reasoning in Language Models (2023) Shao et al. [2023] Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Bradley, H., Fan, H., Galanos, T., Zhou, R., Scott, D., Lehman, J.: The openelm library: Leveraging progress in language models for novel evolutionary algorithms. In: Genetic Programming Theory and Practice XX. Springer, ??? (2024) Chen et al. [2023] Chen, A., Dohan, D.M., So, D.R.: Evoprompting: Language models for code-level neural architecture search. arXiv preprint arXiv:2302.14838 (2023) Liventsev et al. [2023] Liventsev, V., Grishina, A., Härmä, A., Moonen, L.: Fully autonomous programming with large language models. arXiv preprint arXiv:2304.10423 (2023) O’Neill et al. [2010] O’Neill, M., Vanneschi, L., Gustafson, S., Banzhaf, W.: Open issues in genetic programming. Genetic Programming and Evolvable Machines 11, 339–363 (2010) O’Neill and Spector [2020] O’Neill, M., Spector, L.: Automatic programming: The open issue? Genetic Programming and Evolvable Machines 21, 251–262 (2020) Liu et al. [2023] Liu, P., Yuan, W., Fu, J., Jiang, Z., Hayashi, H., Neubig, G.: Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing. ACM Computing Surveys 55(9), 1–35 (2023) Radford et al. [2019] Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., Sutskever, I., et al.: Language models are unsupervised multitask learners. OpenAI blog 1(8), 9 (2019) Brown et al. [2020] Brown, T.B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D.M., Wu, J., Winter, C., Hesse, C., Chen, M., Sigler, E., Litwin, M., Gray, S., Chess, B., Clark, J., Berner, C., McCandlish, S., Radford, A., Sutskever, I., Amodei, D.: Language Models are Few-Shot Learners (2020) OpenAI [2023] OpenAI: GPT-4 Technical Report (2023) Phuong and Hutter [2022] Phuong, M., Hutter, M.: Formal algorithms for transformers. arXiv preprint arXiv:2207.09238 (2022) Ji et al. [2023] Ji, Z., Lee, N., Frieske, R., Yu, T., Su, D., Xu, Y., Ishii, E., Bang, Y.J., Madotto, A., Fung, P.: Survey of hallucination in natural language generation. ACM Comput. Surv. 55(12) (2023) https://doi.org/10.1145/3571730 Strubell et al. [2020] Strubell, E., Ganesh, A., McCallum, A.: Energy and policy considerations for modern deep learning research. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, pp. 13693–13696 (2020) Patterson et al. [2021] Patterson, D., Gonzalez, J., Le, Q., Liang, C., Munguia, L.-M., Rothchild, D., So, D., Texier, M., Dean, J.: Carbon emissions and large neural network training. arXiv preprint arXiv:2104.10350 (2021) Wu et al. [2022] Wu, C.-J., Raghavendra, R., Gupta, U., Acun, B., Ardalani, N., Maeng, K., Chang, G., Aga, F., Huang, J., Bai, C., et al.: Sustainable ai: Environmental implications, challenges and opportunities. Proceedings of Machine Learning and Systems 4, 795–813 (2022) Kaack et al. [2022] Kaack, L.H., Donti, P.L., Strubell, E., Kamiya, G., Creutzig, F., Rolnick, D.: Aligning artificial intelligence with climate change mitigation. Nature Climate Change 12(6), 518–527 (2022) Zhou et al. [2022] Zhou, H., Nova, A., Larochelle, H., Courville, A., Neyshabur, B., Sedghi, H.: Teaching algorithmic reasoning via in-context learning. arXiv preprint arXiv:2211.09066 (2022) Izacard et al. [2022] Izacard, G., Lewis, P., Lomeli, M., Hosseini, L., Petroni, F., Schick, T., Dwivedi-Yu, J., Joulin, A., Riedel, S., Grave, E.: Few-shot learning with retrieval augmented language models. arXiv preprint arXiv:2208.03299 (2022) Wei et al. [2022] Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E., Le, Q., Zhou, D.: Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903 (2022) Wang et al. [2023] Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-Consistency Improves Chain of Thought Reasoning in Language Models (2023) Shao et al. [2023] Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Chen, A., Dohan, D.M., So, D.R.: Evoprompting: Language models for code-level neural architecture search. arXiv preprint arXiv:2302.14838 (2023) Liventsev et al. [2023] Liventsev, V., Grishina, A., Härmä, A., Moonen, L.: Fully autonomous programming with large language models. arXiv preprint arXiv:2304.10423 (2023) O’Neill et al. [2010] O’Neill, M., Vanneschi, L., Gustafson, S., Banzhaf, W.: Open issues in genetic programming. Genetic Programming and Evolvable Machines 11, 339–363 (2010) O’Neill and Spector [2020] O’Neill, M., Spector, L.: Automatic programming: The open issue? Genetic Programming and Evolvable Machines 21, 251–262 (2020) Liu et al. [2023] Liu, P., Yuan, W., Fu, J., Jiang, Z., Hayashi, H., Neubig, G.: Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing. ACM Computing Surveys 55(9), 1–35 (2023) Radford et al. [2019] Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., Sutskever, I., et al.: Language models are unsupervised multitask learners. OpenAI blog 1(8), 9 (2019) Brown et al. [2020] Brown, T.B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D.M., Wu, J., Winter, C., Hesse, C., Chen, M., Sigler, E., Litwin, M., Gray, S., Chess, B., Clark, J., Berner, C., McCandlish, S., Radford, A., Sutskever, I., Amodei, D.: Language Models are Few-Shot Learners (2020) OpenAI [2023] OpenAI: GPT-4 Technical Report (2023) Phuong and Hutter [2022] Phuong, M., Hutter, M.: Formal algorithms for transformers. arXiv preprint arXiv:2207.09238 (2022) Ji et al. [2023] Ji, Z., Lee, N., Frieske, R., Yu, T., Su, D., Xu, Y., Ishii, E., Bang, Y.J., Madotto, A., Fung, P.: Survey of hallucination in natural language generation. ACM Comput. Surv. 55(12) (2023) https://doi.org/10.1145/3571730 Strubell et al. [2020] Strubell, E., Ganesh, A., McCallum, A.: Energy and policy considerations for modern deep learning research. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, pp. 13693–13696 (2020) Patterson et al. [2021] Patterson, D., Gonzalez, J., Le, Q., Liang, C., Munguia, L.-M., Rothchild, D., So, D., Texier, M., Dean, J.: Carbon emissions and large neural network training. arXiv preprint arXiv:2104.10350 (2021) Wu et al. [2022] Wu, C.-J., Raghavendra, R., Gupta, U., Acun, B., Ardalani, N., Maeng, K., Chang, G., Aga, F., Huang, J., Bai, C., et al.: Sustainable ai: Environmental implications, challenges and opportunities. Proceedings of Machine Learning and Systems 4, 795–813 (2022) Kaack et al. [2022] Kaack, L.H., Donti, P.L., Strubell, E., Kamiya, G., Creutzig, F., Rolnick, D.: Aligning artificial intelligence with climate change mitigation. Nature Climate Change 12(6), 518–527 (2022) Zhou et al. [2022] Zhou, H., Nova, A., Larochelle, H., Courville, A., Neyshabur, B., Sedghi, H.: Teaching algorithmic reasoning via in-context learning. arXiv preprint arXiv:2211.09066 (2022) Izacard et al. [2022] Izacard, G., Lewis, P., Lomeli, M., Hosseini, L., Petroni, F., Schick, T., Dwivedi-Yu, J., Joulin, A., Riedel, S., Grave, E.: Few-shot learning with retrieval augmented language models. arXiv preprint arXiv:2208.03299 (2022) Wei et al. [2022] Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E., Le, Q., Zhou, D.: Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903 (2022) Wang et al. [2023] Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-Consistency Improves Chain of Thought Reasoning in Language Models (2023) Shao et al. [2023] Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Liventsev, V., Grishina, A., Härmä, A., Moonen, L.: Fully autonomous programming with large language models. arXiv preprint arXiv:2304.10423 (2023) O’Neill et al. [2010] O’Neill, M., Vanneschi, L., Gustafson, S., Banzhaf, W.: Open issues in genetic programming. Genetic Programming and Evolvable Machines 11, 339–363 (2010) O’Neill and Spector [2020] O’Neill, M., Spector, L.: Automatic programming: The open issue? Genetic Programming and Evolvable Machines 21, 251–262 (2020) Liu et al. [2023] Liu, P., Yuan, W., Fu, J., Jiang, Z., Hayashi, H., Neubig, G.: Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing. ACM Computing Surveys 55(9), 1–35 (2023) Radford et al. [2019] Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., Sutskever, I., et al.: Language models are unsupervised multitask learners. OpenAI blog 1(8), 9 (2019) Brown et al. [2020] Brown, T.B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D.M., Wu, J., Winter, C., Hesse, C., Chen, M., Sigler, E., Litwin, M., Gray, S., Chess, B., Clark, J., Berner, C., McCandlish, S., Radford, A., Sutskever, I., Amodei, D.: Language Models are Few-Shot Learners (2020) OpenAI [2023] OpenAI: GPT-4 Technical Report (2023) Phuong and Hutter [2022] Phuong, M., Hutter, M.: Formal algorithms for transformers. arXiv preprint arXiv:2207.09238 (2022) Ji et al. [2023] Ji, Z., Lee, N., Frieske, R., Yu, T., Su, D., Xu, Y., Ishii, E., Bang, Y.J., Madotto, A., Fung, P.: Survey of hallucination in natural language generation. ACM Comput. Surv. 55(12) (2023) https://doi.org/10.1145/3571730 Strubell et al. [2020] Strubell, E., Ganesh, A., McCallum, A.: Energy and policy considerations for modern deep learning research. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, pp. 13693–13696 (2020) Patterson et al. [2021] Patterson, D., Gonzalez, J., Le, Q., Liang, C., Munguia, L.-M., Rothchild, D., So, D., Texier, M., Dean, J.: Carbon emissions and large neural network training. arXiv preprint arXiv:2104.10350 (2021) Wu et al. [2022] Wu, C.-J., Raghavendra, R., Gupta, U., Acun, B., Ardalani, N., Maeng, K., Chang, G., Aga, F., Huang, J., Bai, C., et al.: Sustainable ai: Environmental implications, challenges and opportunities. Proceedings of Machine Learning and Systems 4, 795–813 (2022) Kaack et al. [2022] Kaack, L.H., Donti, P.L., Strubell, E., Kamiya, G., Creutzig, F., Rolnick, D.: Aligning artificial intelligence with climate change mitigation. Nature Climate Change 12(6), 518–527 (2022) Zhou et al. [2022] Zhou, H., Nova, A., Larochelle, H., Courville, A., Neyshabur, B., Sedghi, H.: Teaching algorithmic reasoning via in-context learning. arXiv preprint arXiv:2211.09066 (2022) Izacard et al. [2022] Izacard, G., Lewis, P., Lomeli, M., Hosseini, L., Petroni, F., Schick, T., Dwivedi-Yu, J., Joulin, A., Riedel, S., Grave, E.: Few-shot learning with retrieval augmented language models. arXiv preprint arXiv:2208.03299 (2022) Wei et al. [2022] Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E., Le, Q., Zhou, D.: Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903 (2022) Wang et al. [2023] Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-Consistency Improves Chain of Thought Reasoning in Language Models (2023) Shao et al. [2023] Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) O’Neill, M., Vanneschi, L., Gustafson, S., Banzhaf, W.: Open issues in genetic programming. Genetic Programming and Evolvable Machines 11, 339–363 (2010) O’Neill and Spector [2020] O’Neill, M., Spector, L.: Automatic programming: The open issue? Genetic Programming and Evolvable Machines 21, 251–262 (2020) Liu et al. [2023] Liu, P., Yuan, W., Fu, J., Jiang, Z., Hayashi, H., Neubig, G.: Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing. ACM Computing Surveys 55(9), 1–35 (2023) Radford et al. [2019] Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., Sutskever, I., et al.: Language models are unsupervised multitask learners. OpenAI blog 1(8), 9 (2019) Brown et al. [2020] Brown, T.B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D.M., Wu, J., Winter, C., Hesse, C., Chen, M., Sigler, E., Litwin, M., Gray, S., Chess, B., Clark, J., Berner, C., McCandlish, S., Radford, A., Sutskever, I., Amodei, D.: Language Models are Few-Shot Learners (2020) OpenAI [2023] OpenAI: GPT-4 Technical Report (2023) Phuong and Hutter [2022] Phuong, M., Hutter, M.: Formal algorithms for transformers. arXiv preprint arXiv:2207.09238 (2022) Ji et al. [2023] Ji, Z., Lee, N., Frieske, R., Yu, T., Su, D., Xu, Y., Ishii, E., Bang, Y.J., Madotto, A., Fung, P.: Survey of hallucination in natural language generation. ACM Comput. Surv. 55(12) (2023) https://doi.org/10.1145/3571730 Strubell et al. [2020] Strubell, E., Ganesh, A., McCallum, A.: Energy and policy considerations for modern deep learning research. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, pp. 13693–13696 (2020) Patterson et al. [2021] Patterson, D., Gonzalez, J., Le, Q., Liang, C., Munguia, L.-M., Rothchild, D., So, D., Texier, M., Dean, J.: Carbon emissions and large neural network training. arXiv preprint arXiv:2104.10350 (2021) Wu et al. [2022] Wu, C.-J., Raghavendra, R., Gupta, U., Acun, B., Ardalani, N., Maeng, K., Chang, G., Aga, F., Huang, J., Bai, C., et al.: Sustainable ai: Environmental implications, challenges and opportunities. Proceedings of Machine Learning and Systems 4, 795–813 (2022) Kaack et al. [2022] Kaack, L.H., Donti, P.L., Strubell, E., Kamiya, G., Creutzig, F., Rolnick, D.: Aligning artificial intelligence with climate change mitigation. Nature Climate Change 12(6), 518–527 (2022) Zhou et al. [2022] Zhou, H., Nova, A., Larochelle, H., Courville, A., Neyshabur, B., Sedghi, H.: Teaching algorithmic reasoning via in-context learning. arXiv preprint arXiv:2211.09066 (2022) Izacard et al. [2022] Izacard, G., Lewis, P., Lomeli, M., Hosseini, L., Petroni, F., Schick, T., Dwivedi-Yu, J., Joulin, A., Riedel, S., Grave, E.: Few-shot learning with retrieval augmented language models. arXiv preprint arXiv:2208.03299 (2022) Wei et al. [2022] Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E., Le, Q., Zhou, D.: Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903 (2022) Wang et al. [2023] Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-Consistency Improves Chain of Thought Reasoning in Language Models (2023) Shao et al. [2023] Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) O’Neill, M., Spector, L.: Automatic programming: The open issue? Genetic Programming and Evolvable Machines 21, 251–262 (2020) Liu et al. [2023] Liu, P., Yuan, W., Fu, J., Jiang, Z., Hayashi, H., Neubig, G.: Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing. ACM Computing Surveys 55(9), 1–35 (2023) Radford et al. [2019] Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., Sutskever, I., et al.: Language models are unsupervised multitask learners. OpenAI blog 1(8), 9 (2019) Brown et al. [2020] Brown, T.B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D.M., Wu, J., Winter, C., Hesse, C., Chen, M., Sigler, E., Litwin, M., Gray, S., Chess, B., Clark, J., Berner, C., McCandlish, S., Radford, A., Sutskever, I., Amodei, D.: Language Models are Few-Shot Learners (2020) OpenAI [2023] OpenAI: GPT-4 Technical Report (2023) Phuong and Hutter [2022] Phuong, M., Hutter, M.: Formal algorithms for transformers. arXiv preprint arXiv:2207.09238 (2022) Ji et al. [2023] Ji, Z., Lee, N., Frieske, R., Yu, T., Su, D., Xu, Y., Ishii, E., Bang, Y.J., Madotto, A., Fung, P.: Survey of hallucination in natural language generation. ACM Comput. Surv. 55(12) (2023) https://doi.org/10.1145/3571730 Strubell et al. [2020] Strubell, E., Ganesh, A., McCallum, A.: Energy and policy considerations for modern deep learning research. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, pp. 13693–13696 (2020) Patterson et al. [2021] Patterson, D., Gonzalez, J., Le, Q., Liang, C., Munguia, L.-M., Rothchild, D., So, D., Texier, M., Dean, J.: Carbon emissions and large neural network training. arXiv preprint arXiv:2104.10350 (2021) Wu et al. [2022] Wu, C.-J., Raghavendra, R., Gupta, U., Acun, B., Ardalani, N., Maeng, K., Chang, G., Aga, F., Huang, J., Bai, C., et al.: Sustainable ai: Environmental implications, challenges and opportunities. Proceedings of Machine Learning and Systems 4, 795–813 (2022) Kaack et al. [2022] Kaack, L.H., Donti, P.L., Strubell, E., Kamiya, G., Creutzig, F., Rolnick, D.: Aligning artificial intelligence with climate change mitigation. Nature Climate Change 12(6), 518–527 (2022) Zhou et al. [2022] Zhou, H., Nova, A., Larochelle, H., Courville, A., Neyshabur, B., Sedghi, H.: Teaching algorithmic reasoning via in-context learning. arXiv preprint arXiv:2211.09066 (2022) Izacard et al. [2022] Izacard, G., Lewis, P., Lomeli, M., Hosseini, L., Petroni, F., Schick, T., Dwivedi-Yu, J., Joulin, A., Riedel, S., Grave, E.: Few-shot learning with retrieval augmented language models. arXiv preprint arXiv:2208.03299 (2022) Wei et al. [2022] Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E., Le, Q., Zhou, D.: Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903 (2022) Wang et al. [2023] Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-Consistency Improves Chain of Thought Reasoning in Language Models (2023) Shao et al. [2023] Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Liu, P., Yuan, W., Fu, J., Jiang, Z., Hayashi, H., Neubig, G.: Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing. ACM Computing Surveys 55(9), 1–35 (2023) Radford et al. [2019] Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., Sutskever, I., et al.: Language models are unsupervised multitask learners. OpenAI blog 1(8), 9 (2019) Brown et al. [2020] Brown, T.B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D.M., Wu, J., Winter, C., Hesse, C., Chen, M., Sigler, E., Litwin, M., Gray, S., Chess, B., Clark, J., Berner, C., McCandlish, S., Radford, A., Sutskever, I., Amodei, D.: Language Models are Few-Shot Learners (2020) OpenAI [2023] OpenAI: GPT-4 Technical Report (2023) Phuong and Hutter [2022] Phuong, M., Hutter, M.: Formal algorithms for transformers. arXiv preprint arXiv:2207.09238 (2022) Ji et al. [2023] Ji, Z., Lee, N., Frieske, R., Yu, T., Su, D., Xu, Y., Ishii, E., Bang, Y.J., Madotto, A., Fung, P.: Survey of hallucination in natural language generation. ACM Comput. Surv. 55(12) (2023) https://doi.org/10.1145/3571730 Strubell et al. [2020] Strubell, E., Ganesh, A., McCallum, A.: Energy and policy considerations for modern deep learning research. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, pp. 13693–13696 (2020) Patterson et al. [2021] Patterson, D., Gonzalez, J., Le, Q., Liang, C., Munguia, L.-M., Rothchild, D., So, D., Texier, M., Dean, J.: Carbon emissions and large neural network training. arXiv preprint arXiv:2104.10350 (2021) Wu et al. [2022] Wu, C.-J., Raghavendra, R., Gupta, U., Acun, B., Ardalani, N., Maeng, K., Chang, G., Aga, F., Huang, J., Bai, C., et al.: Sustainable ai: Environmental implications, challenges and opportunities. Proceedings of Machine Learning and Systems 4, 795–813 (2022) Kaack et al. [2022] Kaack, L.H., Donti, P.L., Strubell, E., Kamiya, G., Creutzig, F., Rolnick, D.: Aligning artificial intelligence with climate change mitigation. Nature Climate Change 12(6), 518–527 (2022) Zhou et al. [2022] Zhou, H., Nova, A., Larochelle, H., Courville, A., Neyshabur, B., Sedghi, H.: Teaching algorithmic reasoning via in-context learning. arXiv preprint arXiv:2211.09066 (2022) Izacard et al. [2022] Izacard, G., Lewis, P., Lomeli, M., Hosseini, L., Petroni, F., Schick, T., Dwivedi-Yu, J., Joulin, A., Riedel, S., Grave, E.: Few-shot learning with retrieval augmented language models. arXiv preprint arXiv:2208.03299 (2022) Wei et al. [2022] Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E., Le, Q., Zhou, D.: Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903 (2022) Wang et al. [2023] Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-Consistency Improves Chain of Thought Reasoning in Language Models (2023) Shao et al. [2023] Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., Sutskever, I., et al.: Language models are unsupervised multitask learners. OpenAI blog 1(8), 9 (2019) Brown et al. [2020] Brown, T.B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D.M., Wu, J., Winter, C., Hesse, C., Chen, M., Sigler, E., Litwin, M., Gray, S., Chess, B., Clark, J., Berner, C., McCandlish, S., Radford, A., Sutskever, I., Amodei, D.: Language Models are Few-Shot Learners (2020) OpenAI [2023] OpenAI: GPT-4 Technical Report (2023) Phuong and Hutter [2022] Phuong, M., Hutter, M.: Formal algorithms for transformers. arXiv preprint arXiv:2207.09238 (2022) Ji et al. [2023] Ji, Z., Lee, N., Frieske, R., Yu, T., Su, D., Xu, Y., Ishii, E., Bang, Y.J., Madotto, A., Fung, P.: Survey of hallucination in natural language generation. ACM Comput. Surv. 55(12) (2023) https://doi.org/10.1145/3571730 Strubell et al. [2020] Strubell, E., Ganesh, A., McCallum, A.: Energy and policy considerations for modern deep learning research. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, pp. 13693–13696 (2020) Patterson et al. [2021] Patterson, D., Gonzalez, J., Le, Q., Liang, C., Munguia, L.-M., Rothchild, D., So, D., Texier, M., Dean, J.: Carbon emissions and large neural network training. arXiv preprint arXiv:2104.10350 (2021) Wu et al. [2022] Wu, C.-J., Raghavendra, R., Gupta, U., Acun, B., Ardalani, N., Maeng, K., Chang, G., Aga, F., Huang, J., Bai, C., et al.: Sustainable ai: Environmental implications, challenges and opportunities. Proceedings of Machine Learning and Systems 4, 795–813 (2022) Kaack et al. [2022] Kaack, L.H., Donti, P.L., Strubell, E., Kamiya, G., Creutzig, F., Rolnick, D.: Aligning artificial intelligence with climate change mitigation. Nature Climate Change 12(6), 518–527 (2022) Zhou et al. [2022] Zhou, H., Nova, A., Larochelle, H., Courville, A., Neyshabur, B., Sedghi, H.: Teaching algorithmic reasoning via in-context learning. arXiv preprint arXiv:2211.09066 (2022) Izacard et al. [2022] Izacard, G., Lewis, P., Lomeli, M., Hosseini, L., Petroni, F., Schick, T., Dwivedi-Yu, J., Joulin, A., Riedel, S., Grave, E.: Few-shot learning with retrieval augmented language models. arXiv preprint arXiv:2208.03299 (2022) Wei et al. [2022] Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E., Le, Q., Zhou, D.: Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903 (2022) Wang et al. [2023] Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-Consistency Improves Chain of Thought Reasoning in Language Models (2023) Shao et al. [2023] Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Brown, T.B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D.M., Wu, J., Winter, C., Hesse, C., Chen, M., Sigler, E., Litwin, M., Gray, S., Chess, B., Clark, J., Berner, C., McCandlish, S., Radford, A., Sutskever, I., Amodei, D.: Language Models are Few-Shot Learners (2020) OpenAI [2023] OpenAI: GPT-4 Technical Report (2023) Phuong and Hutter [2022] Phuong, M., Hutter, M.: Formal algorithms for transformers. arXiv preprint arXiv:2207.09238 (2022) Ji et al. [2023] Ji, Z., Lee, N., Frieske, R., Yu, T., Su, D., Xu, Y., Ishii, E., Bang, Y.J., Madotto, A., Fung, P.: Survey of hallucination in natural language generation. ACM Comput. Surv. 55(12) (2023) https://doi.org/10.1145/3571730 Strubell et al. [2020] Strubell, E., Ganesh, A., McCallum, A.: Energy and policy considerations for modern deep learning research. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, pp. 13693–13696 (2020) Patterson et al. [2021] Patterson, D., Gonzalez, J., Le, Q., Liang, C., Munguia, L.-M., Rothchild, D., So, D., Texier, M., Dean, J.: Carbon emissions and large neural network training. arXiv preprint arXiv:2104.10350 (2021) Wu et al. [2022] Wu, C.-J., Raghavendra, R., Gupta, U., Acun, B., Ardalani, N., Maeng, K., Chang, G., Aga, F., Huang, J., Bai, C., et al.: Sustainable ai: Environmental implications, challenges and opportunities. Proceedings of Machine Learning and Systems 4, 795–813 (2022) Kaack et al. [2022] Kaack, L.H., Donti, P.L., Strubell, E., Kamiya, G., Creutzig, F., Rolnick, D.: Aligning artificial intelligence with climate change mitigation. Nature Climate Change 12(6), 518–527 (2022) Zhou et al. [2022] Zhou, H., Nova, A., Larochelle, H., Courville, A., Neyshabur, B., Sedghi, H.: Teaching algorithmic reasoning via in-context learning. arXiv preprint arXiv:2211.09066 (2022) Izacard et al. [2022] Izacard, G., Lewis, P., Lomeli, M., Hosseini, L., Petroni, F., Schick, T., Dwivedi-Yu, J., Joulin, A., Riedel, S., Grave, E.: Few-shot learning with retrieval augmented language models. arXiv preprint arXiv:2208.03299 (2022) Wei et al. [2022] Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E., Le, Q., Zhou, D.: Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903 (2022) Wang et al. [2023] Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-Consistency Improves Chain of Thought Reasoning in Language Models (2023) Shao et al. [2023] Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) OpenAI: GPT-4 Technical Report (2023) Phuong and Hutter [2022] Phuong, M., Hutter, M.: Formal algorithms for transformers. arXiv preprint arXiv:2207.09238 (2022) Ji et al. [2023] Ji, Z., Lee, N., Frieske, R., Yu, T., Su, D., Xu, Y., Ishii, E., Bang, Y.J., Madotto, A., Fung, P.: Survey of hallucination in natural language generation. ACM Comput. Surv. 55(12) (2023) https://doi.org/10.1145/3571730 Strubell et al. [2020] Strubell, E., Ganesh, A., McCallum, A.: Energy and policy considerations for modern deep learning research. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, pp. 13693–13696 (2020) Patterson et al. [2021] Patterson, D., Gonzalez, J., Le, Q., Liang, C., Munguia, L.-M., Rothchild, D., So, D., Texier, M., Dean, J.: Carbon emissions and large neural network training. arXiv preprint arXiv:2104.10350 (2021) Wu et al. [2022] Wu, C.-J., Raghavendra, R., Gupta, U., Acun, B., Ardalani, N., Maeng, K., Chang, G., Aga, F., Huang, J., Bai, C., et al.: Sustainable ai: Environmental implications, challenges and opportunities. Proceedings of Machine Learning and Systems 4, 795–813 (2022) Kaack et al. [2022] Kaack, L.H., Donti, P.L., Strubell, E., Kamiya, G., Creutzig, F., Rolnick, D.: Aligning artificial intelligence with climate change mitigation. Nature Climate Change 12(6), 518–527 (2022) Zhou et al. [2022] Zhou, H., Nova, A., Larochelle, H., Courville, A., Neyshabur, B., Sedghi, H.: Teaching algorithmic reasoning via in-context learning. arXiv preprint arXiv:2211.09066 (2022) Izacard et al. [2022] Izacard, G., Lewis, P., Lomeli, M., Hosseini, L., Petroni, F., Schick, T., Dwivedi-Yu, J., Joulin, A., Riedel, S., Grave, E.: Few-shot learning with retrieval augmented language models. arXiv preprint arXiv:2208.03299 (2022) Wei et al. [2022] Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E., Le, Q., Zhou, D.: Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903 (2022) Wang et al. [2023] Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-Consistency Improves Chain of Thought Reasoning in Language Models (2023) Shao et al. [2023] Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Phuong, M., Hutter, M.: Formal algorithms for transformers. arXiv preprint arXiv:2207.09238 (2022) Ji et al. [2023] Ji, Z., Lee, N., Frieske, R., Yu, T., Su, D., Xu, Y., Ishii, E., Bang, Y.J., Madotto, A., Fung, P.: Survey of hallucination in natural language generation. ACM Comput. Surv. 55(12) (2023) https://doi.org/10.1145/3571730 Strubell et al. [2020] Strubell, E., Ganesh, A., McCallum, A.: Energy and policy considerations for modern deep learning research. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, pp. 13693–13696 (2020) Patterson et al. [2021] Patterson, D., Gonzalez, J., Le, Q., Liang, C., Munguia, L.-M., Rothchild, D., So, D., Texier, M., Dean, J.: Carbon emissions and large neural network training. arXiv preprint arXiv:2104.10350 (2021) Wu et al. [2022] Wu, C.-J., Raghavendra, R., Gupta, U., Acun, B., Ardalani, N., Maeng, K., Chang, G., Aga, F., Huang, J., Bai, C., et al.: Sustainable ai: Environmental implications, challenges and opportunities. Proceedings of Machine Learning and Systems 4, 795–813 (2022) Kaack et al. [2022] Kaack, L.H., Donti, P.L., Strubell, E., Kamiya, G., Creutzig, F., Rolnick, D.: Aligning artificial intelligence with climate change mitigation. Nature Climate Change 12(6), 518–527 (2022) Zhou et al. [2022] Zhou, H., Nova, A., Larochelle, H., Courville, A., Neyshabur, B., Sedghi, H.: Teaching algorithmic reasoning via in-context learning. arXiv preprint arXiv:2211.09066 (2022) Izacard et al. [2022] Izacard, G., Lewis, P., Lomeli, M., Hosseini, L., Petroni, F., Schick, T., Dwivedi-Yu, J., Joulin, A., Riedel, S., Grave, E.: Few-shot learning with retrieval augmented language models. arXiv preprint arXiv:2208.03299 (2022) Wei et al. [2022] Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E., Le, Q., Zhou, D.: Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903 (2022) Wang et al. [2023] Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-Consistency Improves Chain of Thought Reasoning in Language Models (2023) Shao et al. [2023] Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Ji, Z., Lee, N., Frieske, R., Yu, T., Su, D., Xu, Y., Ishii, E., Bang, Y.J., Madotto, A., Fung, P.: Survey of hallucination in natural language generation. ACM Comput. Surv. 55(12) (2023) https://doi.org/10.1145/3571730 Strubell et al. [2020] Strubell, E., Ganesh, A., McCallum, A.: Energy and policy considerations for modern deep learning research. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, pp. 13693–13696 (2020) Patterson et al. [2021] Patterson, D., Gonzalez, J., Le, Q., Liang, C., Munguia, L.-M., Rothchild, D., So, D., Texier, M., Dean, J.: Carbon emissions and large neural network training. arXiv preprint arXiv:2104.10350 (2021) Wu et al. [2022] Wu, C.-J., Raghavendra, R., Gupta, U., Acun, B., Ardalani, N., Maeng, K., Chang, G., Aga, F., Huang, J., Bai, C., et al.: Sustainable ai: Environmental implications, challenges and opportunities. Proceedings of Machine Learning and Systems 4, 795–813 (2022) Kaack et al. [2022] Kaack, L.H., Donti, P.L., Strubell, E., Kamiya, G., Creutzig, F., Rolnick, D.: Aligning artificial intelligence with climate change mitigation. Nature Climate Change 12(6), 518–527 (2022) Zhou et al. [2022] Zhou, H., Nova, A., Larochelle, H., Courville, A., Neyshabur, B., Sedghi, H.: Teaching algorithmic reasoning via in-context learning. arXiv preprint arXiv:2211.09066 (2022) Izacard et al. [2022] Izacard, G., Lewis, P., Lomeli, M., Hosseini, L., Petroni, F., Schick, T., Dwivedi-Yu, J., Joulin, A., Riedel, S., Grave, E.: Few-shot learning with retrieval augmented language models. arXiv preprint arXiv:2208.03299 (2022) Wei et al. [2022] Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E., Le, Q., Zhou, D.: Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903 (2022) Wang et al. [2023] Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-Consistency Improves Chain of Thought Reasoning in Language Models (2023) Shao et al. [2023] Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Strubell, E., Ganesh, A., McCallum, A.: Energy and policy considerations for modern deep learning research. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, pp. 13693–13696 (2020) Patterson et al. [2021] Patterson, D., Gonzalez, J., Le, Q., Liang, C., Munguia, L.-M., Rothchild, D., So, D., Texier, M., Dean, J.: Carbon emissions and large neural network training. arXiv preprint arXiv:2104.10350 (2021) Wu et al. [2022] Wu, C.-J., Raghavendra, R., Gupta, U., Acun, B., Ardalani, N., Maeng, K., Chang, G., Aga, F., Huang, J., Bai, C., et al.: Sustainable ai: Environmental implications, challenges and opportunities. Proceedings of Machine Learning and Systems 4, 795–813 (2022) Kaack et al. [2022] Kaack, L.H., Donti, P.L., Strubell, E., Kamiya, G., Creutzig, F., Rolnick, D.: Aligning artificial intelligence with climate change mitigation. Nature Climate Change 12(6), 518–527 (2022) Zhou et al. [2022] Zhou, H., Nova, A., Larochelle, H., Courville, A., Neyshabur, B., Sedghi, H.: Teaching algorithmic reasoning via in-context learning. arXiv preprint arXiv:2211.09066 (2022) Izacard et al. [2022] Izacard, G., Lewis, P., Lomeli, M., Hosseini, L., Petroni, F., Schick, T., Dwivedi-Yu, J., Joulin, A., Riedel, S., Grave, E.: Few-shot learning with retrieval augmented language models. arXiv preprint arXiv:2208.03299 (2022) Wei et al. [2022] Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E., Le, Q., Zhou, D.: Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903 (2022) Wang et al. [2023] Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-Consistency Improves Chain of Thought Reasoning in Language Models (2023) Shao et al. [2023] Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Patterson, D., Gonzalez, J., Le, Q., Liang, C., Munguia, L.-M., Rothchild, D., So, D., Texier, M., Dean, J.: Carbon emissions and large neural network training. arXiv preprint arXiv:2104.10350 (2021) Wu et al. [2022] Wu, C.-J., Raghavendra, R., Gupta, U., Acun, B., Ardalani, N., Maeng, K., Chang, G., Aga, F., Huang, J., Bai, C., et al.: Sustainable ai: Environmental implications, challenges and opportunities. Proceedings of Machine Learning and Systems 4, 795–813 (2022) Kaack et al. [2022] Kaack, L.H., Donti, P.L., Strubell, E., Kamiya, G., Creutzig, F., Rolnick, D.: Aligning artificial intelligence with climate change mitigation. Nature Climate Change 12(6), 518–527 (2022) Zhou et al. [2022] Zhou, H., Nova, A., Larochelle, H., Courville, A., Neyshabur, B., Sedghi, H.: Teaching algorithmic reasoning via in-context learning. arXiv preprint arXiv:2211.09066 (2022) Izacard et al. [2022] Izacard, G., Lewis, P., Lomeli, M., Hosseini, L., Petroni, F., Schick, T., Dwivedi-Yu, J., Joulin, A., Riedel, S., Grave, E.: Few-shot learning with retrieval augmented language models. arXiv preprint arXiv:2208.03299 (2022) Wei et al. [2022] Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E., Le, Q., Zhou, D.: Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903 (2022) Wang et al. [2023] Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-Consistency Improves Chain of Thought Reasoning in Language Models (2023) Shao et al. [2023] Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Wu, C.-J., Raghavendra, R., Gupta, U., Acun, B., Ardalani, N., Maeng, K., Chang, G., Aga, F., Huang, J., Bai, C., et al.: Sustainable ai: Environmental implications, challenges and opportunities. Proceedings of Machine Learning and Systems 4, 795–813 (2022) Kaack et al. [2022] Kaack, L.H., Donti, P.L., Strubell, E., Kamiya, G., Creutzig, F., Rolnick, D.: Aligning artificial intelligence with climate change mitigation. Nature Climate Change 12(6), 518–527 (2022) Zhou et al. [2022] Zhou, H., Nova, A., Larochelle, H., Courville, A., Neyshabur, B., Sedghi, H.: Teaching algorithmic reasoning via in-context learning. arXiv preprint arXiv:2211.09066 (2022) Izacard et al. [2022] Izacard, G., Lewis, P., Lomeli, M., Hosseini, L., Petroni, F., Schick, T., Dwivedi-Yu, J., Joulin, A., Riedel, S., Grave, E.: Few-shot learning with retrieval augmented language models. arXiv preprint arXiv:2208.03299 (2022) Wei et al. [2022] Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E., Le, Q., Zhou, D.: Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903 (2022) Wang et al. [2023] Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-Consistency Improves Chain of Thought Reasoning in Language Models (2023) Shao et al. [2023] Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Kaack, L.H., Donti, P.L., Strubell, E., Kamiya, G., Creutzig, F., Rolnick, D.: Aligning artificial intelligence with climate change mitigation. Nature Climate Change 12(6), 518–527 (2022) Zhou et al. [2022] Zhou, H., Nova, A., Larochelle, H., Courville, A., Neyshabur, B., Sedghi, H.: Teaching algorithmic reasoning via in-context learning. arXiv preprint arXiv:2211.09066 (2022) Izacard et al. [2022] Izacard, G., Lewis, P., Lomeli, M., Hosseini, L., Petroni, F., Schick, T., Dwivedi-Yu, J., Joulin, A., Riedel, S., Grave, E.: Few-shot learning with retrieval augmented language models. arXiv preprint arXiv:2208.03299 (2022) Wei et al. [2022] Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E., Le, Q., Zhou, D.: Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903 (2022) Wang et al. [2023] Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-Consistency Improves Chain of Thought Reasoning in Language Models (2023) Shao et al. [2023] Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Zhou, H., Nova, A., Larochelle, H., Courville, A., Neyshabur, B., Sedghi, H.: Teaching algorithmic reasoning via in-context learning. arXiv preprint arXiv:2211.09066 (2022) Izacard et al. [2022] Izacard, G., Lewis, P., Lomeli, M., Hosseini, L., Petroni, F., Schick, T., Dwivedi-Yu, J., Joulin, A., Riedel, S., Grave, E.: Few-shot learning with retrieval augmented language models. arXiv preprint arXiv:2208.03299 (2022) Wei et al. [2022] Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E., Le, Q., Zhou, D.: Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903 (2022) Wang et al. [2023] Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-Consistency Improves Chain of Thought Reasoning in Language Models (2023) Shao et al. [2023] Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Izacard, G., Lewis, P., Lomeli, M., Hosseini, L., Petroni, F., Schick, T., Dwivedi-Yu, J., Joulin, A., Riedel, S., Grave, E.: Few-shot learning with retrieval augmented language models. arXiv preprint arXiv:2208.03299 (2022) Wei et al. [2022] Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E., Le, Q., Zhou, D.: Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903 (2022) Wang et al. [2023] Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-Consistency Improves Chain of Thought Reasoning in Language Models (2023) Shao et al. [2023] Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E., Le, Q., Zhou, D.: Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903 (2022) Wang et al. [2023] Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-Consistency Improves Chain of Thought Reasoning in Language Models (2023) Shao et al. [2023] Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-Consistency Improves Chain of Thought Reasoning in Language Models (2023) Shao et al. [2023] Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023)
- Bradley, H., Fan, H., Galanos, T., Zhou, R., Scott, D., Lehman, J.: The openelm library: Leveraging progress in language models for novel evolutionary algorithms. In: Genetic Programming Theory and Practice XX. Springer, ??? (2024) Chen et al. [2023] Chen, A., Dohan, D.M., So, D.R.: Evoprompting: Language models for code-level neural architecture search. arXiv preprint arXiv:2302.14838 (2023) Liventsev et al. [2023] Liventsev, V., Grishina, A., Härmä, A., Moonen, L.: Fully autonomous programming with large language models. arXiv preprint arXiv:2304.10423 (2023) O’Neill et al. [2010] O’Neill, M., Vanneschi, L., Gustafson, S., Banzhaf, W.: Open issues in genetic programming. Genetic Programming and Evolvable Machines 11, 339–363 (2010) O’Neill and Spector [2020] O’Neill, M., Spector, L.: Automatic programming: The open issue? Genetic Programming and Evolvable Machines 21, 251–262 (2020) Liu et al. [2023] Liu, P., Yuan, W., Fu, J., Jiang, Z., Hayashi, H., Neubig, G.: Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing. ACM Computing Surveys 55(9), 1–35 (2023) Radford et al. [2019] Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., Sutskever, I., et al.: Language models are unsupervised multitask learners. OpenAI blog 1(8), 9 (2019) Brown et al. [2020] Brown, T.B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D.M., Wu, J., Winter, C., Hesse, C., Chen, M., Sigler, E., Litwin, M., Gray, S., Chess, B., Clark, J., Berner, C., McCandlish, S., Radford, A., Sutskever, I., Amodei, D.: Language Models are Few-Shot Learners (2020) OpenAI [2023] OpenAI: GPT-4 Technical Report (2023) Phuong and Hutter [2022] Phuong, M., Hutter, M.: Formal algorithms for transformers. arXiv preprint arXiv:2207.09238 (2022) Ji et al. [2023] Ji, Z., Lee, N., Frieske, R., Yu, T., Su, D., Xu, Y., Ishii, E., Bang, Y.J., Madotto, A., Fung, P.: Survey of hallucination in natural language generation. ACM Comput. Surv. 55(12) (2023) https://doi.org/10.1145/3571730 Strubell et al. [2020] Strubell, E., Ganesh, A., McCallum, A.: Energy and policy considerations for modern deep learning research. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, pp. 13693–13696 (2020) Patterson et al. [2021] Patterson, D., Gonzalez, J., Le, Q., Liang, C., Munguia, L.-M., Rothchild, D., So, D., Texier, M., Dean, J.: Carbon emissions and large neural network training. arXiv preprint arXiv:2104.10350 (2021) Wu et al. [2022] Wu, C.-J., Raghavendra, R., Gupta, U., Acun, B., Ardalani, N., Maeng, K., Chang, G., Aga, F., Huang, J., Bai, C., et al.: Sustainable ai: Environmental implications, challenges and opportunities. Proceedings of Machine Learning and Systems 4, 795–813 (2022) Kaack et al. [2022] Kaack, L.H., Donti, P.L., Strubell, E., Kamiya, G., Creutzig, F., Rolnick, D.: Aligning artificial intelligence with climate change mitigation. Nature Climate Change 12(6), 518–527 (2022) Zhou et al. [2022] Zhou, H., Nova, A., Larochelle, H., Courville, A., Neyshabur, B., Sedghi, H.: Teaching algorithmic reasoning via in-context learning. arXiv preprint arXiv:2211.09066 (2022) Izacard et al. [2022] Izacard, G., Lewis, P., Lomeli, M., Hosseini, L., Petroni, F., Schick, T., Dwivedi-Yu, J., Joulin, A., Riedel, S., Grave, E.: Few-shot learning with retrieval augmented language models. arXiv preprint arXiv:2208.03299 (2022) Wei et al. [2022] Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E., Le, Q., Zhou, D.: Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903 (2022) Wang et al. [2023] Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-Consistency Improves Chain of Thought Reasoning in Language Models (2023) Shao et al. [2023] Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Chen, A., Dohan, D.M., So, D.R.: Evoprompting: Language models for code-level neural architecture search. arXiv preprint arXiv:2302.14838 (2023) Liventsev et al. [2023] Liventsev, V., Grishina, A., Härmä, A., Moonen, L.: Fully autonomous programming with large language models. arXiv preprint arXiv:2304.10423 (2023) O’Neill et al. [2010] O’Neill, M., Vanneschi, L., Gustafson, S., Banzhaf, W.: Open issues in genetic programming. Genetic Programming and Evolvable Machines 11, 339–363 (2010) O’Neill and Spector [2020] O’Neill, M., Spector, L.: Automatic programming: The open issue? Genetic Programming and Evolvable Machines 21, 251–262 (2020) Liu et al. [2023] Liu, P., Yuan, W., Fu, J., Jiang, Z., Hayashi, H., Neubig, G.: Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing. ACM Computing Surveys 55(9), 1–35 (2023) Radford et al. [2019] Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., Sutskever, I., et al.: Language models are unsupervised multitask learners. OpenAI blog 1(8), 9 (2019) Brown et al. [2020] Brown, T.B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D.M., Wu, J., Winter, C., Hesse, C., Chen, M., Sigler, E., Litwin, M., Gray, S., Chess, B., Clark, J., Berner, C., McCandlish, S., Radford, A., Sutskever, I., Amodei, D.: Language Models are Few-Shot Learners (2020) OpenAI [2023] OpenAI: GPT-4 Technical Report (2023) Phuong and Hutter [2022] Phuong, M., Hutter, M.: Formal algorithms for transformers. arXiv preprint arXiv:2207.09238 (2022) Ji et al. [2023] Ji, Z., Lee, N., Frieske, R., Yu, T., Su, D., Xu, Y., Ishii, E., Bang, Y.J., Madotto, A., Fung, P.: Survey of hallucination in natural language generation. ACM Comput. Surv. 55(12) (2023) https://doi.org/10.1145/3571730 Strubell et al. [2020] Strubell, E., Ganesh, A., McCallum, A.: Energy and policy considerations for modern deep learning research. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, pp. 13693–13696 (2020) Patterson et al. [2021] Patterson, D., Gonzalez, J., Le, Q., Liang, C., Munguia, L.-M., Rothchild, D., So, D., Texier, M., Dean, J.: Carbon emissions and large neural network training. arXiv preprint arXiv:2104.10350 (2021) Wu et al. [2022] Wu, C.-J., Raghavendra, R., Gupta, U., Acun, B., Ardalani, N., Maeng, K., Chang, G., Aga, F., Huang, J., Bai, C., et al.: Sustainable ai: Environmental implications, challenges and opportunities. Proceedings of Machine Learning and Systems 4, 795–813 (2022) Kaack et al. [2022] Kaack, L.H., Donti, P.L., Strubell, E., Kamiya, G., Creutzig, F., Rolnick, D.: Aligning artificial intelligence with climate change mitigation. Nature Climate Change 12(6), 518–527 (2022) Zhou et al. [2022] Zhou, H., Nova, A., Larochelle, H., Courville, A., Neyshabur, B., Sedghi, H.: Teaching algorithmic reasoning via in-context learning. arXiv preprint arXiv:2211.09066 (2022) Izacard et al. [2022] Izacard, G., Lewis, P., Lomeli, M., Hosseini, L., Petroni, F., Schick, T., Dwivedi-Yu, J., Joulin, A., Riedel, S., Grave, E.: Few-shot learning with retrieval augmented language models. arXiv preprint arXiv:2208.03299 (2022) Wei et al. [2022] Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E., Le, Q., Zhou, D.: Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903 (2022) Wang et al. [2023] Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-Consistency Improves Chain of Thought Reasoning in Language Models (2023) Shao et al. [2023] Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Liventsev, V., Grishina, A., Härmä, A., Moonen, L.: Fully autonomous programming with large language models. arXiv preprint arXiv:2304.10423 (2023) O’Neill et al. [2010] O’Neill, M., Vanneschi, L., Gustafson, S., Banzhaf, W.: Open issues in genetic programming. Genetic Programming and Evolvable Machines 11, 339–363 (2010) O’Neill and Spector [2020] O’Neill, M., Spector, L.: Automatic programming: The open issue? Genetic Programming and Evolvable Machines 21, 251–262 (2020) Liu et al. [2023] Liu, P., Yuan, W., Fu, J., Jiang, Z., Hayashi, H., Neubig, G.: Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing. ACM Computing Surveys 55(9), 1–35 (2023) Radford et al. [2019] Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., Sutskever, I., et al.: Language models are unsupervised multitask learners. OpenAI blog 1(8), 9 (2019) Brown et al. [2020] Brown, T.B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D.M., Wu, J., Winter, C., Hesse, C., Chen, M., Sigler, E., Litwin, M., Gray, S., Chess, B., Clark, J., Berner, C., McCandlish, S., Radford, A., Sutskever, I., Amodei, D.: Language Models are Few-Shot Learners (2020) OpenAI [2023] OpenAI: GPT-4 Technical Report (2023) Phuong and Hutter [2022] Phuong, M., Hutter, M.: Formal algorithms for transformers. arXiv preprint arXiv:2207.09238 (2022) Ji et al. [2023] Ji, Z., Lee, N., Frieske, R., Yu, T., Su, D., Xu, Y., Ishii, E., Bang, Y.J., Madotto, A., Fung, P.: Survey of hallucination in natural language generation. ACM Comput. Surv. 55(12) (2023) https://doi.org/10.1145/3571730 Strubell et al. [2020] Strubell, E., Ganesh, A., McCallum, A.: Energy and policy considerations for modern deep learning research. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, pp. 13693–13696 (2020) Patterson et al. [2021] Patterson, D., Gonzalez, J., Le, Q., Liang, C., Munguia, L.-M., Rothchild, D., So, D., Texier, M., Dean, J.: Carbon emissions and large neural network training. arXiv preprint arXiv:2104.10350 (2021) Wu et al. [2022] Wu, C.-J., Raghavendra, R., Gupta, U., Acun, B., Ardalani, N., Maeng, K., Chang, G., Aga, F., Huang, J., Bai, C., et al.: Sustainable ai: Environmental implications, challenges and opportunities. Proceedings of Machine Learning and Systems 4, 795–813 (2022) Kaack et al. [2022] Kaack, L.H., Donti, P.L., Strubell, E., Kamiya, G., Creutzig, F., Rolnick, D.: Aligning artificial intelligence with climate change mitigation. Nature Climate Change 12(6), 518–527 (2022) Zhou et al. [2022] Zhou, H., Nova, A., Larochelle, H., Courville, A., Neyshabur, B., Sedghi, H.: Teaching algorithmic reasoning via in-context learning. arXiv preprint arXiv:2211.09066 (2022) Izacard et al. [2022] Izacard, G., Lewis, P., Lomeli, M., Hosseini, L., Petroni, F., Schick, T., Dwivedi-Yu, J., Joulin, A., Riedel, S., Grave, E.: Few-shot learning with retrieval augmented language models. arXiv preprint arXiv:2208.03299 (2022) Wei et al. [2022] Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E., Le, Q., Zhou, D.: Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903 (2022) Wang et al. [2023] Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-Consistency Improves Chain of Thought Reasoning in Language Models (2023) Shao et al. [2023] Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) O’Neill, M., Vanneschi, L., Gustafson, S., Banzhaf, W.: Open issues in genetic programming. Genetic Programming and Evolvable Machines 11, 339–363 (2010) O’Neill and Spector [2020] O’Neill, M., Spector, L.: Automatic programming: The open issue? Genetic Programming and Evolvable Machines 21, 251–262 (2020) Liu et al. [2023] Liu, P., Yuan, W., Fu, J., Jiang, Z., Hayashi, H., Neubig, G.: Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing. ACM Computing Surveys 55(9), 1–35 (2023) Radford et al. [2019] Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., Sutskever, I., et al.: Language models are unsupervised multitask learners. OpenAI blog 1(8), 9 (2019) Brown et al. [2020] Brown, T.B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D.M., Wu, J., Winter, C., Hesse, C., Chen, M., Sigler, E., Litwin, M., Gray, S., Chess, B., Clark, J., Berner, C., McCandlish, S., Radford, A., Sutskever, I., Amodei, D.: Language Models are Few-Shot Learners (2020) OpenAI [2023] OpenAI: GPT-4 Technical Report (2023) Phuong and Hutter [2022] Phuong, M., Hutter, M.: Formal algorithms for transformers. arXiv preprint arXiv:2207.09238 (2022) Ji et al. [2023] Ji, Z., Lee, N., Frieske, R., Yu, T., Su, D., Xu, Y., Ishii, E., Bang, Y.J., Madotto, A., Fung, P.: Survey of hallucination in natural language generation. ACM Comput. Surv. 55(12) (2023) https://doi.org/10.1145/3571730 Strubell et al. [2020] Strubell, E., Ganesh, A., McCallum, A.: Energy and policy considerations for modern deep learning research. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, pp. 13693–13696 (2020) Patterson et al. [2021] Patterson, D., Gonzalez, J., Le, Q., Liang, C., Munguia, L.-M., Rothchild, D., So, D., Texier, M., Dean, J.: Carbon emissions and large neural network training. arXiv preprint arXiv:2104.10350 (2021) Wu et al. [2022] Wu, C.-J., Raghavendra, R., Gupta, U., Acun, B., Ardalani, N., Maeng, K., Chang, G., Aga, F., Huang, J., Bai, C., et al.: Sustainable ai: Environmental implications, challenges and opportunities. Proceedings of Machine Learning and Systems 4, 795–813 (2022) Kaack et al. [2022] Kaack, L.H., Donti, P.L., Strubell, E., Kamiya, G., Creutzig, F., Rolnick, D.: Aligning artificial intelligence with climate change mitigation. Nature Climate Change 12(6), 518–527 (2022) Zhou et al. [2022] Zhou, H., Nova, A., Larochelle, H., Courville, A., Neyshabur, B., Sedghi, H.: Teaching algorithmic reasoning via in-context learning. arXiv preprint arXiv:2211.09066 (2022) Izacard et al. [2022] Izacard, G., Lewis, P., Lomeli, M., Hosseini, L., Petroni, F., Schick, T., Dwivedi-Yu, J., Joulin, A., Riedel, S., Grave, E.: Few-shot learning with retrieval augmented language models. arXiv preprint arXiv:2208.03299 (2022) Wei et al. [2022] Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E., Le, Q., Zhou, D.: Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903 (2022) Wang et al. [2023] Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-Consistency Improves Chain of Thought Reasoning in Language Models (2023) Shao et al. [2023] Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) O’Neill, M., Spector, L.: Automatic programming: The open issue? Genetic Programming and Evolvable Machines 21, 251–262 (2020) Liu et al. [2023] Liu, P., Yuan, W., Fu, J., Jiang, Z., Hayashi, H., Neubig, G.: Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing. ACM Computing Surveys 55(9), 1–35 (2023) Radford et al. [2019] Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., Sutskever, I., et al.: Language models are unsupervised multitask learners. OpenAI blog 1(8), 9 (2019) Brown et al. [2020] Brown, T.B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D.M., Wu, J., Winter, C., Hesse, C., Chen, M., Sigler, E., Litwin, M., Gray, S., Chess, B., Clark, J., Berner, C., McCandlish, S., Radford, A., Sutskever, I., Amodei, D.: Language Models are Few-Shot Learners (2020) OpenAI [2023] OpenAI: GPT-4 Technical Report (2023) Phuong and Hutter [2022] Phuong, M., Hutter, M.: Formal algorithms for transformers. arXiv preprint arXiv:2207.09238 (2022) Ji et al. [2023] Ji, Z., Lee, N., Frieske, R., Yu, T., Su, D., Xu, Y., Ishii, E., Bang, Y.J., Madotto, A., Fung, P.: Survey of hallucination in natural language generation. ACM Comput. Surv. 55(12) (2023) https://doi.org/10.1145/3571730 Strubell et al. [2020] Strubell, E., Ganesh, A., McCallum, A.: Energy and policy considerations for modern deep learning research. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, pp. 13693–13696 (2020) Patterson et al. [2021] Patterson, D., Gonzalez, J., Le, Q., Liang, C., Munguia, L.-M., Rothchild, D., So, D., Texier, M., Dean, J.: Carbon emissions and large neural network training. arXiv preprint arXiv:2104.10350 (2021) Wu et al. [2022] Wu, C.-J., Raghavendra, R., Gupta, U., Acun, B., Ardalani, N., Maeng, K., Chang, G., Aga, F., Huang, J., Bai, C., et al.: Sustainable ai: Environmental implications, challenges and opportunities. Proceedings of Machine Learning and Systems 4, 795–813 (2022) Kaack et al. [2022] Kaack, L.H., Donti, P.L., Strubell, E., Kamiya, G., Creutzig, F., Rolnick, D.: Aligning artificial intelligence with climate change mitigation. Nature Climate Change 12(6), 518–527 (2022) Zhou et al. [2022] Zhou, H., Nova, A., Larochelle, H., Courville, A., Neyshabur, B., Sedghi, H.: Teaching algorithmic reasoning via in-context learning. arXiv preprint arXiv:2211.09066 (2022) Izacard et al. [2022] Izacard, G., Lewis, P., Lomeli, M., Hosseini, L., Petroni, F., Schick, T., Dwivedi-Yu, J., Joulin, A., Riedel, S., Grave, E.: Few-shot learning with retrieval augmented language models. arXiv preprint arXiv:2208.03299 (2022) Wei et al. [2022] Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E., Le, Q., Zhou, D.: Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903 (2022) Wang et al. [2023] Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-Consistency Improves Chain of Thought Reasoning in Language Models (2023) Shao et al. [2023] Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Liu, P., Yuan, W., Fu, J., Jiang, Z., Hayashi, H., Neubig, G.: Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing. ACM Computing Surveys 55(9), 1–35 (2023) Radford et al. [2019] Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., Sutskever, I., et al.: Language models are unsupervised multitask learners. OpenAI blog 1(8), 9 (2019) Brown et al. [2020] Brown, T.B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D.M., Wu, J., Winter, C., Hesse, C., Chen, M., Sigler, E., Litwin, M., Gray, S., Chess, B., Clark, J., Berner, C., McCandlish, S., Radford, A., Sutskever, I., Amodei, D.: Language Models are Few-Shot Learners (2020) OpenAI [2023] OpenAI: GPT-4 Technical Report (2023) Phuong and Hutter [2022] Phuong, M., Hutter, M.: Formal algorithms for transformers. arXiv preprint arXiv:2207.09238 (2022) Ji et al. [2023] Ji, Z., Lee, N., Frieske, R., Yu, T., Su, D., Xu, Y., Ishii, E., Bang, Y.J., Madotto, A., Fung, P.: Survey of hallucination in natural language generation. ACM Comput. Surv. 55(12) (2023) https://doi.org/10.1145/3571730 Strubell et al. [2020] Strubell, E., Ganesh, A., McCallum, A.: Energy and policy considerations for modern deep learning research. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, pp. 13693–13696 (2020) Patterson et al. [2021] Patterson, D., Gonzalez, J., Le, Q., Liang, C., Munguia, L.-M., Rothchild, D., So, D., Texier, M., Dean, J.: Carbon emissions and large neural network training. arXiv preprint arXiv:2104.10350 (2021) Wu et al. [2022] Wu, C.-J., Raghavendra, R., Gupta, U., Acun, B., Ardalani, N., Maeng, K., Chang, G., Aga, F., Huang, J., Bai, C., et al.: Sustainable ai: Environmental implications, challenges and opportunities. Proceedings of Machine Learning and Systems 4, 795–813 (2022) Kaack et al. [2022] Kaack, L.H., Donti, P.L., Strubell, E., Kamiya, G., Creutzig, F., Rolnick, D.: Aligning artificial intelligence with climate change mitigation. Nature Climate Change 12(6), 518–527 (2022) Zhou et al. [2022] Zhou, H., Nova, A., Larochelle, H., Courville, A., Neyshabur, B., Sedghi, H.: Teaching algorithmic reasoning via in-context learning. arXiv preprint arXiv:2211.09066 (2022) Izacard et al. [2022] Izacard, G., Lewis, P., Lomeli, M., Hosseini, L., Petroni, F., Schick, T., Dwivedi-Yu, J., Joulin, A., Riedel, S., Grave, E.: Few-shot learning with retrieval augmented language models. arXiv preprint arXiv:2208.03299 (2022) Wei et al. [2022] Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E., Le, Q., Zhou, D.: Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903 (2022) Wang et al. [2023] Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-Consistency Improves Chain of Thought Reasoning in Language Models (2023) Shao et al. [2023] Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., Sutskever, I., et al.: Language models are unsupervised multitask learners. OpenAI blog 1(8), 9 (2019) Brown et al. [2020] Brown, T.B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D.M., Wu, J., Winter, C., Hesse, C., Chen, M., Sigler, E., Litwin, M., Gray, S., Chess, B., Clark, J., Berner, C., McCandlish, S., Radford, A., Sutskever, I., Amodei, D.: Language Models are Few-Shot Learners (2020) OpenAI [2023] OpenAI: GPT-4 Technical Report (2023) Phuong and Hutter [2022] Phuong, M., Hutter, M.: Formal algorithms for transformers. arXiv preprint arXiv:2207.09238 (2022) Ji et al. [2023] Ji, Z., Lee, N., Frieske, R., Yu, T., Su, D., Xu, Y., Ishii, E., Bang, Y.J., Madotto, A., Fung, P.: Survey of hallucination in natural language generation. ACM Comput. Surv. 55(12) (2023) https://doi.org/10.1145/3571730 Strubell et al. [2020] Strubell, E., Ganesh, A., McCallum, A.: Energy and policy considerations for modern deep learning research. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, pp. 13693–13696 (2020) Patterson et al. [2021] Patterson, D., Gonzalez, J., Le, Q., Liang, C., Munguia, L.-M., Rothchild, D., So, D., Texier, M., Dean, J.: Carbon emissions and large neural network training. arXiv preprint arXiv:2104.10350 (2021) Wu et al. [2022] Wu, C.-J., Raghavendra, R., Gupta, U., Acun, B., Ardalani, N., Maeng, K., Chang, G., Aga, F., Huang, J., Bai, C., et al.: Sustainable ai: Environmental implications, challenges and opportunities. Proceedings of Machine Learning and Systems 4, 795–813 (2022) Kaack et al. [2022] Kaack, L.H., Donti, P.L., Strubell, E., Kamiya, G., Creutzig, F., Rolnick, D.: Aligning artificial intelligence with climate change mitigation. Nature Climate Change 12(6), 518–527 (2022) Zhou et al. [2022] Zhou, H., Nova, A., Larochelle, H., Courville, A., Neyshabur, B., Sedghi, H.: Teaching algorithmic reasoning via in-context learning. arXiv preprint arXiv:2211.09066 (2022) Izacard et al. [2022] Izacard, G., Lewis, P., Lomeli, M., Hosseini, L., Petroni, F., Schick, T., Dwivedi-Yu, J., Joulin, A., Riedel, S., Grave, E.: Few-shot learning with retrieval augmented language models. arXiv preprint arXiv:2208.03299 (2022) Wei et al. [2022] Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E., Le, Q., Zhou, D.: Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903 (2022) Wang et al. [2023] Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-Consistency Improves Chain of Thought Reasoning in Language Models (2023) Shao et al. [2023] Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Brown, T.B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D.M., Wu, J., Winter, C., Hesse, C., Chen, M., Sigler, E., Litwin, M., Gray, S., Chess, B., Clark, J., Berner, C., McCandlish, S., Radford, A., Sutskever, I., Amodei, D.: Language Models are Few-Shot Learners (2020) OpenAI [2023] OpenAI: GPT-4 Technical Report (2023) Phuong and Hutter [2022] Phuong, M., Hutter, M.: Formal algorithms for transformers. arXiv preprint arXiv:2207.09238 (2022) Ji et al. [2023] Ji, Z., Lee, N., Frieske, R., Yu, T., Su, D., Xu, Y., Ishii, E., Bang, Y.J., Madotto, A., Fung, P.: Survey of hallucination in natural language generation. ACM Comput. Surv. 55(12) (2023) https://doi.org/10.1145/3571730 Strubell et al. [2020] Strubell, E., Ganesh, A., McCallum, A.: Energy and policy considerations for modern deep learning research. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, pp. 13693–13696 (2020) Patterson et al. [2021] Patterson, D., Gonzalez, J., Le, Q., Liang, C., Munguia, L.-M., Rothchild, D., So, D., Texier, M., Dean, J.: Carbon emissions and large neural network training. arXiv preprint arXiv:2104.10350 (2021) Wu et al. [2022] Wu, C.-J., Raghavendra, R., Gupta, U., Acun, B., Ardalani, N., Maeng, K., Chang, G., Aga, F., Huang, J., Bai, C., et al.: Sustainable ai: Environmental implications, challenges and opportunities. Proceedings of Machine Learning and Systems 4, 795–813 (2022) Kaack et al. [2022] Kaack, L.H., Donti, P.L., Strubell, E., Kamiya, G., Creutzig, F., Rolnick, D.: Aligning artificial intelligence with climate change mitigation. Nature Climate Change 12(6), 518–527 (2022) Zhou et al. [2022] Zhou, H., Nova, A., Larochelle, H., Courville, A., Neyshabur, B., Sedghi, H.: Teaching algorithmic reasoning via in-context learning. arXiv preprint arXiv:2211.09066 (2022) Izacard et al. [2022] Izacard, G., Lewis, P., Lomeli, M., Hosseini, L., Petroni, F., Schick, T., Dwivedi-Yu, J., Joulin, A., Riedel, S., Grave, E.: Few-shot learning with retrieval augmented language models. arXiv preprint arXiv:2208.03299 (2022) Wei et al. [2022] Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E., Le, Q., Zhou, D.: Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903 (2022) Wang et al. [2023] Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-Consistency Improves Chain of Thought Reasoning in Language Models (2023) Shao et al. [2023] Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) OpenAI: GPT-4 Technical Report (2023) Phuong and Hutter [2022] Phuong, M., Hutter, M.: Formal algorithms for transformers. arXiv preprint arXiv:2207.09238 (2022) Ji et al. [2023] Ji, Z., Lee, N., Frieske, R., Yu, T., Su, D., Xu, Y., Ishii, E., Bang, Y.J., Madotto, A., Fung, P.: Survey of hallucination in natural language generation. ACM Comput. Surv. 55(12) (2023) https://doi.org/10.1145/3571730 Strubell et al. [2020] Strubell, E., Ganesh, A., McCallum, A.: Energy and policy considerations for modern deep learning research. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, pp. 13693–13696 (2020) Patterson et al. [2021] Patterson, D., Gonzalez, J., Le, Q., Liang, C., Munguia, L.-M., Rothchild, D., So, D., Texier, M., Dean, J.: Carbon emissions and large neural network training. arXiv preprint arXiv:2104.10350 (2021) Wu et al. [2022] Wu, C.-J., Raghavendra, R., Gupta, U., Acun, B., Ardalani, N., Maeng, K., Chang, G., Aga, F., Huang, J., Bai, C., et al.: Sustainable ai: Environmental implications, challenges and opportunities. Proceedings of Machine Learning and Systems 4, 795–813 (2022) Kaack et al. [2022] Kaack, L.H., Donti, P.L., Strubell, E., Kamiya, G., Creutzig, F., Rolnick, D.: Aligning artificial intelligence with climate change mitigation. Nature Climate Change 12(6), 518–527 (2022) Zhou et al. [2022] Zhou, H., Nova, A., Larochelle, H., Courville, A., Neyshabur, B., Sedghi, H.: Teaching algorithmic reasoning via in-context learning. arXiv preprint arXiv:2211.09066 (2022) Izacard et al. [2022] Izacard, G., Lewis, P., Lomeli, M., Hosseini, L., Petroni, F., Schick, T., Dwivedi-Yu, J., Joulin, A., Riedel, S., Grave, E.: Few-shot learning with retrieval augmented language models. arXiv preprint arXiv:2208.03299 (2022) Wei et al. [2022] Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E., Le, Q., Zhou, D.: Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903 (2022) Wang et al. [2023] Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-Consistency Improves Chain of Thought Reasoning in Language Models (2023) Shao et al. [2023] Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Phuong, M., Hutter, M.: Formal algorithms for transformers. arXiv preprint arXiv:2207.09238 (2022) Ji et al. [2023] Ji, Z., Lee, N., Frieske, R., Yu, T., Su, D., Xu, Y., Ishii, E., Bang, Y.J., Madotto, A., Fung, P.: Survey of hallucination in natural language generation. ACM Comput. Surv. 55(12) (2023) https://doi.org/10.1145/3571730 Strubell et al. [2020] Strubell, E., Ganesh, A., McCallum, A.: Energy and policy considerations for modern deep learning research. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, pp. 13693–13696 (2020) Patterson et al. [2021] Patterson, D., Gonzalez, J., Le, Q., Liang, C., Munguia, L.-M., Rothchild, D., So, D., Texier, M., Dean, J.: Carbon emissions and large neural network training. arXiv preprint arXiv:2104.10350 (2021) Wu et al. [2022] Wu, C.-J., Raghavendra, R., Gupta, U., Acun, B., Ardalani, N., Maeng, K., Chang, G., Aga, F., Huang, J., Bai, C., et al.: Sustainable ai: Environmental implications, challenges and opportunities. Proceedings of Machine Learning and Systems 4, 795–813 (2022) Kaack et al. [2022] Kaack, L.H., Donti, P.L., Strubell, E., Kamiya, G., Creutzig, F., Rolnick, D.: Aligning artificial intelligence with climate change mitigation. Nature Climate Change 12(6), 518–527 (2022) Zhou et al. [2022] Zhou, H., Nova, A., Larochelle, H., Courville, A., Neyshabur, B., Sedghi, H.: Teaching algorithmic reasoning via in-context learning. arXiv preprint arXiv:2211.09066 (2022) Izacard et al. [2022] Izacard, G., Lewis, P., Lomeli, M., Hosseini, L., Petroni, F., Schick, T., Dwivedi-Yu, J., Joulin, A., Riedel, S., Grave, E.: Few-shot learning with retrieval augmented language models. arXiv preprint arXiv:2208.03299 (2022) Wei et al. [2022] Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E., Le, Q., Zhou, D.: Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903 (2022) Wang et al. [2023] Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-Consistency Improves Chain of Thought Reasoning in Language Models (2023) Shao et al. [2023] Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Ji, Z., Lee, N., Frieske, R., Yu, T., Su, D., Xu, Y., Ishii, E., Bang, Y.J., Madotto, A., Fung, P.: Survey of hallucination in natural language generation. ACM Comput. Surv. 55(12) (2023) https://doi.org/10.1145/3571730 Strubell et al. [2020] Strubell, E., Ganesh, A., McCallum, A.: Energy and policy considerations for modern deep learning research. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, pp. 13693–13696 (2020) Patterson et al. [2021] Patterson, D., Gonzalez, J., Le, Q., Liang, C., Munguia, L.-M., Rothchild, D., So, D., Texier, M., Dean, J.: Carbon emissions and large neural network training. arXiv preprint arXiv:2104.10350 (2021) Wu et al. [2022] Wu, C.-J., Raghavendra, R., Gupta, U., Acun, B., Ardalani, N., Maeng, K., Chang, G., Aga, F., Huang, J., Bai, C., et al.: Sustainable ai: Environmental implications, challenges and opportunities. Proceedings of Machine Learning and Systems 4, 795–813 (2022) Kaack et al. [2022] Kaack, L.H., Donti, P.L., Strubell, E., Kamiya, G., Creutzig, F., Rolnick, D.: Aligning artificial intelligence with climate change mitigation. Nature Climate Change 12(6), 518–527 (2022) Zhou et al. [2022] Zhou, H., Nova, A., Larochelle, H., Courville, A., Neyshabur, B., Sedghi, H.: Teaching algorithmic reasoning via in-context learning. arXiv preprint arXiv:2211.09066 (2022) Izacard et al. [2022] Izacard, G., Lewis, P., Lomeli, M., Hosseini, L., Petroni, F., Schick, T., Dwivedi-Yu, J., Joulin, A., Riedel, S., Grave, E.: Few-shot learning with retrieval augmented language models. arXiv preprint arXiv:2208.03299 (2022) Wei et al. [2022] Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E., Le, Q., Zhou, D.: Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903 (2022) Wang et al. [2023] Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-Consistency Improves Chain of Thought Reasoning in Language Models (2023) Shao et al. [2023] Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Strubell, E., Ganesh, A., McCallum, A.: Energy and policy considerations for modern deep learning research. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, pp. 13693–13696 (2020) Patterson et al. [2021] Patterson, D., Gonzalez, J., Le, Q., Liang, C., Munguia, L.-M., Rothchild, D., So, D., Texier, M., Dean, J.: Carbon emissions and large neural network training. arXiv preprint arXiv:2104.10350 (2021) Wu et al. [2022] Wu, C.-J., Raghavendra, R., Gupta, U., Acun, B., Ardalani, N., Maeng, K., Chang, G., Aga, F., Huang, J., Bai, C., et al.: Sustainable ai: Environmental implications, challenges and opportunities. Proceedings of Machine Learning and Systems 4, 795–813 (2022) Kaack et al. [2022] Kaack, L.H., Donti, P.L., Strubell, E., Kamiya, G., Creutzig, F., Rolnick, D.: Aligning artificial intelligence with climate change mitigation. Nature Climate Change 12(6), 518–527 (2022) Zhou et al. [2022] Zhou, H., Nova, A., Larochelle, H., Courville, A., Neyshabur, B., Sedghi, H.: Teaching algorithmic reasoning via in-context learning. arXiv preprint arXiv:2211.09066 (2022) Izacard et al. [2022] Izacard, G., Lewis, P., Lomeli, M., Hosseini, L., Petroni, F., Schick, T., Dwivedi-Yu, J., Joulin, A., Riedel, S., Grave, E.: Few-shot learning with retrieval augmented language models. arXiv preprint arXiv:2208.03299 (2022) Wei et al. [2022] Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E., Le, Q., Zhou, D.: Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903 (2022) Wang et al. [2023] Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-Consistency Improves Chain of Thought Reasoning in Language Models (2023) Shao et al. [2023] Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Patterson, D., Gonzalez, J., Le, Q., Liang, C., Munguia, L.-M., Rothchild, D., So, D., Texier, M., Dean, J.: Carbon emissions and large neural network training. arXiv preprint arXiv:2104.10350 (2021) Wu et al. [2022] Wu, C.-J., Raghavendra, R., Gupta, U., Acun, B., Ardalani, N., Maeng, K., Chang, G., Aga, F., Huang, J., Bai, C., et al.: Sustainable ai: Environmental implications, challenges and opportunities. Proceedings of Machine Learning and Systems 4, 795–813 (2022) Kaack et al. [2022] Kaack, L.H., Donti, P.L., Strubell, E., Kamiya, G., Creutzig, F., Rolnick, D.: Aligning artificial intelligence with climate change mitigation. Nature Climate Change 12(6), 518–527 (2022) Zhou et al. [2022] Zhou, H., Nova, A., Larochelle, H., Courville, A., Neyshabur, B., Sedghi, H.: Teaching algorithmic reasoning via in-context learning. arXiv preprint arXiv:2211.09066 (2022) Izacard et al. [2022] Izacard, G., Lewis, P., Lomeli, M., Hosseini, L., Petroni, F., Schick, T., Dwivedi-Yu, J., Joulin, A., Riedel, S., Grave, E.: Few-shot learning with retrieval augmented language models. arXiv preprint arXiv:2208.03299 (2022) Wei et al. [2022] Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E., Le, Q., Zhou, D.: Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903 (2022) Wang et al. [2023] Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-Consistency Improves Chain of Thought Reasoning in Language Models (2023) Shao et al. [2023] Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Wu, C.-J., Raghavendra, R., Gupta, U., Acun, B., Ardalani, N., Maeng, K., Chang, G., Aga, F., Huang, J., Bai, C., et al.: Sustainable ai: Environmental implications, challenges and opportunities. Proceedings of Machine Learning and Systems 4, 795–813 (2022) Kaack et al. [2022] Kaack, L.H., Donti, P.L., Strubell, E., Kamiya, G., Creutzig, F., Rolnick, D.: Aligning artificial intelligence with climate change mitigation. Nature Climate Change 12(6), 518–527 (2022) Zhou et al. [2022] Zhou, H., Nova, A., Larochelle, H., Courville, A., Neyshabur, B., Sedghi, H.: Teaching algorithmic reasoning via in-context learning. arXiv preprint arXiv:2211.09066 (2022) Izacard et al. [2022] Izacard, G., Lewis, P., Lomeli, M., Hosseini, L., Petroni, F., Schick, T., Dwivedi-Yu, J., Joulin, A., Riedel, S., Grave, E.: Few-shot learning with retrieval augmented language models. arXiv preprint arXiv:2208.03299 (2022) Wei et al. [2022] Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E., Le, Q., Zhou, D.: Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903 (2022) Wang et al. [2023] Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-Consistency Improves Chain of Thought Reasoning in Language Models (2023) Shao et al. [2023] Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Kaack, L.H., Donti, P.L., Strubell, E., Kamiya, G., Creutzig, F., Rolnick, D.: Aligning artificial intelligence with climate change mitigation. Nature Climate Change 12(6), 518–527 (2022) Zhou et al. [2022] Zhou, H., Nova, A., Larochelle, H., Courville, A., Neyshabur, B., Sedghi, H.: Teaching algorithmic reasoning via in-context learning. arXiv preprint arXiv:2211.09066 (2022) Izacard et al. [2022] Izacard, G., Lewis, P., Lomeli, M., Hosseini, L., Petroni, F., Schick, T., Dwivedi-Yu, J., Joulin, A., Riedel, S., Grave, E.: Few-shot learning with retrieval augmented language models. arXiv preprint arXiv:2208.03299 (2022) Wei et al. [2022] Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E., Le, Q., Zhou, D.: Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903 (2022) Wang et al. [2023] Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-Consistency Improves Chain of Thought Reasoning in Language Models (2023) Shao et al. [2023] Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Zhou, H., Nova, A., Larochelle, H., Courville, A., Neyshabur, B., Sedghi, H.: Teaching algorithmic reasoning via in-context learning. arXiv preprint arXiv:2211.09066 (2022) Izacard et al. [2022] Izacard, G., Lewis, P., Lomeli, M., Hosseini, L., Petroni, F., Schick, T., Dwivedi-Yu, J., Joulin, A., Riedel, S., Grave, E.: Few-shot learning with retrieval augmented language models. arXiv preprint arXiv:2208.03299 (2022) Wei et al. [2022] Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E., Le, Q., Zhou, D.: Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903 (2022) Wang et al. [2023] Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-Consistency Improves Chain of Thought Reasoning in Language Models (2023) Shao et al. [2023] Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Izacard, G., Lewis, P., Lomeli, M., Hosseini, L., Petroni, F., Schick, T., Dwivedi-Yu, J., Joulin, A., Riedel, S., Grave, E.: Few-shot learning with retrieval augmented language models. arXiv preprint arXiv:2208.03299 (2022) Wei et al. [2022] Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E., Le, Q., Zhou, D.: Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903 (2022) Wang et al. [2023] Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-Consistency Improves Chain of Thought Reasoning in Language Models (2023) Shao et al. [2023] Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E., Le, Q., Zhou, D.: Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903 (2022) Wang et al. [2023] Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-Consistency Improves Chain of Thought Reasoning in Language Models (2023) Shao et al. [2023] Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-Consistency Improves Chain of Thought Reasoning in Language Models (2023) Shao et al. [2023] Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023)
- Chen, A., Dohan, D.M., So, D.R.: Evoprompting: Language models for code-level neural architecture search. arXiv preprint arXiv:2302.14838 (2023) Liventsev et al. [2023] Liventsev, V., Grishina, A., Härmä, A., Moonen, L.: Fully autonomous programming with large language models. arXiv preprint arXiv:2304.10423 (2023) O’Neill et al. [2010] O’Neill, M., Vanneschi, L., Gustafson, S., Banzhaf, W.: Open issues in genetic programming. Genetic Programming and Evolvable Machines 11, 339–363 (2010) O’Neill and Spector [2020] O’Neill, M., Spector, L.: Automatic programming: The open issue? Genetic Programming and Evolvable Machines 21, 251–262 (2020) Liu et al. [2023] Liu, P., Yuan, W., Fu, J., Jiang, Z., Hayashi, H., Neubig, G.: Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing. ACM Computing Surveys 55(9), 1–35 (2023) Radford et al. [2019] Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., Sutskever, I., et al.: Language models are unsupervised multitask learners. OpenAI blog 1(8), 9 (2019) Brown et al. [2020] Brown, T.B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D.M., Wu, J., Winter, C., Hesse, C., Chen, M., Sigler, E., Litwin, M., Gray, S., Chess, B., Clark, J., Berner, C., McCandlish, S., Radford, A., Sutskever, I., Amodei, D.: Language Models are Few-Shot Learners (2020) OpenAI [2023] OpenAI: GPT-4 Technical Report (2023) Phuong and Hutter [2022] Phuong, M., Hutter, M.: Formal algorithms for transformers. arXiv preprint arXiv:2207.09238 (2022) Ji et al. [2023] Ji, Z., Lee, N., Frieske, R., Yu, T., Su, D., Xu, Y., Ishii, E., Bang, Y.J., Madotto, A., Fung, P.: Survey of hallucination in natural language generation. ACM Comput. Surv. 55(12) (2023) https://doi.org/10.1145/3571730 Strubell et al. [2020] Strubell, E., Ganesh, A., McCallum, A.: Energy and policy considerations for modern deep learning research. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, pp. 13693–13696 (2020) Patterson et al. [2021] Patterson, D., Gonzalez, J., Le, Q., Liang, C., Munguia, L.-M., Rothchild, D., So, D., Texier, M., Dean, J.: Carbon emissions and large neural network training. arXiv preprint arXiv:2104.10350 (2021) Wu et al. [2022] Wu, C.-J., Raghavendra, R., Gupta, U., Acun, B., Ardalani, N., Maeng, K., Chang, G., Aga, F., Huang, J., Bai, C., et al.: Sustainable ai: Environmental implications, challenges and opportunities. Proceedings of Machine Learning and Systems 4, 795–813 (2022) Kaack et al. [2022] Kaack, L.H., Donti, P.L., Strubell, E., Kamiya, G., Creutzig, F., Rolnick, D.: Aligning artificial intelligence with climate change mitigation. Nature Climate Change 12(6), 518–527 (2022) Zhou et al. [2022] Zhou, H., Nova, A., Larochelle, H., Courville, A., Neyshabur, B., Sedghi, H.: Teaching algorithmic reasoning via in-context learning. arXiv preprint arXiv:2211.09066 (2022) Izacard et al. [2022] Izacard, G., Lewis, P., Lomeli, M., Hosseini, L., Petroni, F., Schick, T., Dwivedi-Yu, J., Joulin, A., Riedel, S., Grave, E.: Few-shot learning with retrieval augmented language models. arXiv preprint arXiv:2208.03299 (2022) Wei et al. [2022] Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E., Le, Q., Zhou, D.: Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903 (2022) Wang et al. [2023] Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-Consistency Improves Chain of Thought Reasoning in Language Models (2023) Shao et al. [2023] Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Liventsev, V., Grishina, A., Härmä, A., Moonen, L.: Fully autonomous programming with large language models. arXiv preprint arXiv:2304.10423 (2023) O’Neill et al. [2010] O’Neill, M., Vanneschi, L., Gustafson, S., Banzhaf, W.: Open issues in genetic programming. Genetic Programming and Evolvable Machines 11, 339–363 (2010) O’Neill and Spector [2020] O’Neill, M., Spector, L.: Automatic programming: The open issue? Genetic Programming and Evolvable Machines 21, 251–262 (2020) Liu et al. [2023] Liu, P., Yuan, W., Fu, J., Jiang, Z., Hayashi, H., Neubig, G.: Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing. ACM Computing Surveys 55(9), 1–35 (2023) Radford et al. [2019] Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., Sutskever, I., et al.: Language models are unsupervised multitask learners. OpenAI blog 1(8), 9 (2019) Brown et al. [2020] Brown, T.B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D.M., Wu, J., Winter, C., Hesse, C., Chen, M., Sigler, E., Litwin, M., Gray, S., Chess, B., Clark, J., Berner, C., McCandlish, S., Radford, A., Sutskever, I., Amodei, D.: Language Models are Few-Shot Learners (2020) OpenAI [2023] OpenAI: GPT-4 Technical Report (2023) Phuong and Hutter [2022] Phuong, M., Hutter, M.: Formal algorithms for transformers. arXiv preprint arXiv:2207.09238 (2022) Ji et al. [2023] Ji, Z., Lee, N., Frieske, R., Yu, T., Su, D., Xu, Y., Ishii, E., Bang, Y.J., Madotto, A., Fung, P.: Survey of hallucination in natural language generation. ACM Comput. Surv. 55(12) (2023) https://doi.org/10.1145/3571730 Strubell et al. [2020] Strubell, E., Ganesh, A., McCallum, A.: Energy and policy considerations for modern deep learning research. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, pp. 13693–13696 (2020) Patterson et al. [2021] Patterson, D., Gonzalez, J., Le, Q., Liang, C., Munguia, L.-M., Rothchild, D., So, D., Texier, M., Dean, J.: Carbon emissions and large neural network training. arXiv preprint arXiv:2104.10350 (2021) Wu et al. [2022] Wu, C.-J., Raghavendra, R., Gupta, U., Acun, B., Ardalani, N., Maeng, K., Chang, G., Aga, F., Huang, J., Bai, C., et al.: Sustainable ai: Environmental implications, challenges and opportunities. Proceedings of Machine Learning and Systems 4, 795–813 (2022) Kaack et al. [2022] Kaack, L.H., Donti, P.L., Strubell, E., Kamiya, G., Creutzig, F., Rolnick, D.: Aligning artificial intelligence with climate change mitigation. Nature Climate Change 12(6), 518–527 (2022) Zhou et al. [2022] Zhou, H., Nova, A., Larochelle, H., Courville, A., Neyshabur, B., Sedghi, H.: Teaching algorithmic reasoning via in-context learning. arXiv preprint arXiv:2211.09066 (2022) Izacard et al. [2022] Izacard, G., Lewis, P., Lomeli, M., Hosseini, L., Petroni, F., Schick, T., Dwivedi-Yu, J., Joulin, A., Riedel, S., Grave, E.: Few-shot learning with retrieval augmented language models. arXiv preprint arXiv:2208.03299 (2022) Wei et al. [2022] Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E., Le, Q., Zhou, D.: Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903 (2022) Wang et al. [2023] Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-Consistency Improves Chain of Thought Reasoning in Language Models (2023) Shao et al. [2023] Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) O’Neill, M., Vanneschi, L., Gustafson, S., Banzhaf, W.: Open issues in genetic programming. Genetic Programming and Evolvable Machines 11, 339–363 (2010) O’Neill and Spector [2020] O’Neill, M., Spector, L.: Automatic programming: The open issue? Genetic Programming and Evolvable Machines 21, 251–262 (2020) Liu et al. [2023] Liu, P., Yuan, W., Fu, J., Jiang, Z., Hayashi, H., Neubig, G.: Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing. ACM Computing Surveys 55(9), 1–35 (2023) Radford et al. [2019] Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., Sutskever, I., et al.: Language models are unsupervised multitask learners. OpenAI blog 1(8), 9 (2019) Brown et al. [2020] Brown, T.B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D.M., Wu, J., Winter, C., Hesse, C., Chen, M., Sigler, E., Litwin, M., Gray, S., Chess, B., Clark, J., Berner, C., McCandlish, S., Radford, A., Sutskever, I., Amodei, D.: Language Models are Few-Shot Learners (2020) OpenAI [2023] OpenAI: GPT-4 Technical Report (2023) Phuong and Hutter [2022] Phuong, M., Hutter, M.: Formal algorithms for transformers. arXiv preprint arXiv:2207.09238 (2022) Ji et al. [2023] Ji, Z., Lee, N., Frieske, R., Yu, T., Su, D., Xu, Y., Ishii, E., Bang, Y.J., Madotto, A., Fung, P.: Survey of hallucination in natural language generation. ACM Comput. Surv. 55(12) (2023) https://doi.org/10.1145/3571730 Strubell et al. [2020] Strubell, E., Ganesh, A., McCallum, A.: Energy and policy considerations for modern deep learning research. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, pp. 13693–13696 (2020) Patterson et al. [2021] Patterson, D., Gonzalez, J., Le, Q., Liang, C., Munguia, L.-M., Rothchild, D., So, D., Texier, M., Dean, J.: Carbon emissions and large neural network training. arXiv preprint arXiv:2104.10350 (2021) Wu et al. [2022] Wu, C.-J., Raghavendra, R., Gupta, U., Acun, B., Ardalani, N., Maeng, K., Chang, G., Aga, F., Huang, J., Bai, C., et al.: Sustainable ai: Environmental implications, challenges and opportunities. Proceedings of Machine Learning and Systems 4, 795–813 (2022) Kaack et al. [2022] Kaack, L.H., Donti, P.L., Strubell, E., Kamiya, G., Creutzig, F., Rolnick, D.: Aligning artificial intelligence with climate change mitigation. Nature Climate Change 12(6), 518–527 (2022) Zhou et al. [2022] Zhou, H., Nova, A., Larochelle, H., Courville, A., Neyshabur, B., Sedghi, H.: Teaching algorithmic reasoning via in-context learning. arXiv preprint arXiv:2211.09066 (2022) Izacard et al. [2022] Izacard, G., Lewis, P., Lomeli, M., Hosseini, L., Petroni, F., Schick, T., Dwivedi-Yu, J., Joulin, A., Riedel, S., Grave, E.: Few-shot learning with retrieval augmented language models. arXiv preprint arXiv:2208.03299 (2022) Wei et al. [2022] Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E., Le, Q., Zhou, D.: Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903 (2022) Wang et al. [2023] Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-Consistency Improves Chain of Thought Reasoning in Language Models (2023) Shao et al. [2023] Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) O’Neill, M., Spector, L.: Automatic programming: The open issue? Genetic Programming and Evolvable Machines 21, 251–262 (2020) Liu et al. [2023] Liu, P., Yuan, W., Fu, J., Jiang, Z., Hayashi, H., Neubig, G.: Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing. ACM Computing Surveys 55(9), 1–35 (2023) Radford et al. [2019] Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., Sutskever, I., et al.: Language models are unsupervised multitask learners. OpenAI blog 1(8), 9 (2019) Brown et al. [2020] Brown, T.B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D.M., Wu, J., Winter, C., Hesse, C., Chen, M., Sigler, E., Litwin, M., Gray, S., Chess, B., Clark, J., Berner, C., McCandlish, S., Radford, A., Sutskever, I., Amodei, D.: Language Models are Few-Shot Learners (2020) OpenAI [2023] OpenAI: GPT-4 Technical Report (2023) Phuong and Hutter [2022] Phuong, M., Hutter, M.: Formal algorithms for transformers. arXiv preprint arXiv:2207.09238 (2022) Ji et al. [2023] Ji, Z., Lee, N., Frieske, R., Yu, T., Su, D., Xu, Y., Ishii, E., Bang, Y.J., Madotto, A., Fung, P.: Survey of hallucination in natural language generation. ACM Comput. Surv. 55(12) (2023) https://doi.org/10.1145/3571730 Strubell et al. [2020] Strubell, E., Ganesh, A., McCallum, A.: Energy and policy considerations for modern deep learning research. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, pp. 13693–13696 (2020) Patterson et al. [2021] Patterson, D., Gonzalez, J., Le, Q., Liang, C., Munguia, L.-M., Rothchild, D., So, D., Texier, M., Dean, J.: Carbon emissions and large neural network training. arXiv preprint arXiv:2104.10350 (2021) Wu et al. [2022] Wu, C.-J., Raghavendra, R., Gupta, U., Acun, B., Ardalani, N., Maeng, K., Chang, G., Aga, F., Huang, J., Bai, C., et al.: Sustainable ai: Environmental implications, challenges and opportunities. Proceedings of Machine Learning and Systems 4, 795–813 (2022) Kaack et al. [2022] Kaack, L.H., Donti, P.L., Strubell, E., Kamiya, G., Creutzig, F., Rolnick, D.: Aligning artificial intelligence with climate change mitigation. Nature Climate Change 12(6), 518–527 (2022) Zhou et al. [2022] Zhou, H., Nova, A., Larochelle, H., Courville, A., Neyshabur, B., Sedghi, H.: Teaching algorithmic reasoning via in-context learning. arXiv preprint arXiv:2211.09066 (2022) Izacard et al. [2022] Izacard, G., Lewis, P., Lomeli, M., Hosseini, L., Petroni, F., Schick, T., Dwivedi-Yu, J., Joulin, A., Riedel, S., Grave, E.: Few-shot learning with retrieval augmented language models. arXiv preprint arXiv:2208.03299 (2022) Wei et al. [2022] Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E., Le, Q., Zhou, D.: Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903 (2022) Wang et al. [2023] Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-Consistency Improves Chain of Thought Reasoning in Language Models (2023) Shao et al. [2023] Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Liu, P., Yuan, W., Fu, J., Jiang, Z., Hayashi, H., Neubig, G.: Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing. ACM Computing Surveys 55(9), 1–35 (2023) Radford et al. [2019] Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., Sutskever, I., et al.: Language models are unsupervised multitask learners. OpenAI blog 1(8), 9 (2019) Brown et al. [2020] Brown, T.B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D.M., Wu, J., Winter, C., Hesse, C., Chen, M., Sigler, E., Litwin, M., Gray, S., Chess, B., Clark, J., Berner, C., McCandlish, S., Radford, A., Sutskever, I., Amodei, D.: Language Models are Few-Shot Learners (2020) OpenAI [2023] OpenAI: GPT-4 Technical Report (2023) Phuong and Hutter [2022] Phuong, M., Hutter, M.: Formal algorithms for transformers. arXiv preprint arXiv:2207.09238 (2022) Ji et al. [2023] Ji, Z., Lee, N., Frieske, R., Yu, T., Su, D., Xu, Y., Ishii, E., Bang, Y.J., Madotto, A., Fung, P.: Survey of hallucination in natural language generation. ACM Comput. Surv. 55(12) (2023) https://doi.org/10.1145/3571730 Strubell et al. [2020] Strubell, E., Ganesh, A., McCallum, A.: Energy and policy considerations for modern deep learning research. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, pp. 13693–13696 (2020) Patterson et al. [2021] Patterson, D., Gonzalez, J., Le, Q., Liang, C., Munguia, L.-M., Rothchild, D., So, D., Texier, M., Dean, J.: Carbon emissions and large neural network training. arXiv preprint arXiv:2104.10350 (2021) Wu et al. [2022] Wu, C.-J., Raghavendra, R., Gupta, U., Acun, B., Ardalani, N., Maeng, K., Chang, G., Aga, F., Huang, J., Bai, C., et al.: Sustainable ai: Environmental implications, challenges and opportunities. Proceedings of Machine Learning and Systems 4, 795–813 (2022) Kaack et al. [2022] Kaack, L.H., Donti, P.L., Strubell, E., Kamiya, G., Creutzig, F., Rolnick, D.: Aligning artificial intelligence with climate change mitigation. Nature Climate Change 12(6), 518–527 (2022) Zhou et al. [2022] Zhou, H., Nova, A., Larochelle, H., Courville, A., Neyshabur, B., Sedghi, H.: Teaching algorithmic reasoning via in-context learning. arXiv preprint arXiv:2211.09066 (2022) Izacard et al. [2022] Izacard, G., Lewis, P., Lomeli, M., Hosseini, L., Petroni, F., Schick, T., Dwivedi-Yu, J., Joulin, A., Riedel, S., Grave, E.: Few-shot learning with retrieval augmented language models. arXiv preprint arXiv:2208.03299 (2022) Wei et al. [2022] Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E., Le, Q., Zhou, D.: Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903 (2022) Wang et al. [2023] Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-Consistency Improves Chain of Thought Reasoning in Language Models (2023) Shao et al. [2023] Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., Sutskever, I., et al.: Language models are unsupervised multitask learners. OpenAI blog 1(8), 9 (2019) Brown et al. [2020] Brown, T.B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D.M., Wu, J., Winter, C., Hesse, C., Chen, M., Sigler, E., Litwin, M., Gray, S., Chess, B., Clark, J., Berner, C., McCandlish, S., Radford, A., Sutskever, I., Amodei, D.: Language Models are Few-Shot Learners (2020) OpenAI [2023] OpenAI: GPT-4 Technical Report (2023) Phuong and Hutter [2022] Phuong, M., Hutter, M.: Formal algorithms for transformers. arXiv preprint arXiv:2207.09238 (2022) Ji et al. [2023] Ji, Z., Lee, N., Frieske, R., Yu, T., Su, D., Xu, Y., Ishii, E., Bang, Y.J., Madotto, A., Fung, P.: Survey of hallucination in natural language generation. ACM Comput. Surv. 55(12) (2023) https://doi.org/10.1145/3571730 Strubell et al. [2020] Strubell, E., Ganesh, A., McCallum, A.: Energy and policy considerations for modern deep learning research. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, pp. 13693–13696 (2020) Patterson et al. [2021] Patterson, D., Gonzalez, J., Le, Q., Liang, C., Munguia, L.-M., Rothchild, D., So, D., Texier, M., Dean, J.: Carbon emissions and large neural network training. arXiv preprint arXiv:2104.10350 (2021) Wu et al. [2022] Wu, C.-J., Raghavendra, R., Gupta, U., Acun, B., Ardalani, N., Maeng, K., Chang, G., Aga, F., Huang, J., Bai, C., et al.: Sustainable ai: Environmental implications, challenges and opportunities. Proceedings of Machine Learning and Systems 4, 795–813 (2022) Kaack et al. [2022] Kaack, L.H., Donti, P.L., Strubell, E., Kamiya, G., Creutzig, F., Rolnick, D.: Aligning artificial intelligence with climate change mitigation. Nature Climate Change 12(6), 518–527 (2022) Zhou et al. [2022] Zhou, H., Nova, A., Larochelle, H., Courville, A., Neyshabur, B., Sedghi, H.: Teaching algorithmic reasoning via in-context learning. arXiv preprint arXiv:2211.09066 (2022) Izacard et al. [2022] Izacard, G., Lewis, P., Lomeli, M., Hosseini, L., Petroni, F., Schick, T., Dwivedi-Yu, J., Joulin, A., Riedel, S., Grave, E.: Few-shot learning with retrieval augmented language models. arXiv preprint arXiv:2208.03299 (2022) Wei et al. [2022] Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E., Le, Q., Zhou, D.: Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903 (2022) Wang et al. [2023] Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-Consistency Improves Chain of Thought Reasoning in Language Models (2023) Shao et al. [2023] Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Brown, T.B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D.M., Wu, J., Winter, C., Hesse, C., Chen, M., Sigler, E., Litwin, M., Gray, S., Chess, B., Clark, J., Berner, C., McCandlish, S., Radford, A., Sutskever, I., Amodei, D.: Language Models are Few-Shot Learners (2020) OpenAI [2023] OpenAI: GPT-4 Technical Report (2023) Phuong and Hutter [2022] Phuong, M., Hutter, M.: Formal algorithms for transformers. arXiv preprint arXiv:2207.09238 (2022) Ji et al. [2023] Ji, Z., Lee, N., Frieske, R., Yu, T., Su, D., Xu, Y., Ishii, E., Bang, Y.J., Madotto, A., Fung, P.: Survey of hallucination in natural language generation. ACM Comput. Surv. 55(12) (2023) https://doi.org/10.1145/3571730 Strubell et al. [2020] Strubell, E., Ganesh, A., McCallum, A.: Energy and policy considerations for modern deep learning research. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, pp. 13693–13696 (2020) Patterson et al. [2021] Patterson, D., Gonzalez, J., Le, Q., Liang, C., Munguia, L.-M., Rothchild, D., So, D., Texier, M., Dean, J.: Carbon emissions and large neural network training. arXiv preprint arXiv:2104.10350 (2021) Wu et al. [2022] Wu, C.-J., Raghavendra, R., Gupta, U., Acun, B., Ardalani, N., Maeng, K., Chang, G., Aga, F., Huang, J., Bai, C., et al.: Sustainable ai: Environmental implications, challenges and opportunities. Proceedings of Machine Learning and Systems 4, 795–813 (2022) Kaack et al. [2022] Kaack, L.H., Donti, P.L., Strubell, E., Kamiya, G., Creutzig, F., Rolnick, D.: Aligning artificial intelligence with climate change mitigation. Nature Climate Change 12(6), 518–527 (2022) Zhou et al. [2022] Zhou, H., Nova, A., Larochelle, H., Courville, A., Neyshabur, B., Sedghi, H.: Teaching algorithmic reasoning via in-context learning. arXiv preprint arXiv:2211.09066 (2022) Izacard et al. [2022] Izacard, G., Lewis, P., Lomeli, M., Hosseini, L., Petroni, F., Schick, T., Dwivedi-Yu, J., Joulin, A., Riedel, S., Grave, E.: Few-shot learning with retrieval augmented language models. arXiv preprint arXiv:2208.03299 (2022) Wei et al. [2022] Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E., Le, Q., Zhou, D.: Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903 (2022) Wang et al. [2023] Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-Consistency Improves Chain of Thought Reasoning in Language Models (2023) Shao et al. [2023] Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) OpenAI: GPT-4 Technical Report (2023) Phuong and Hutter [2022] Phuong, M., Hutter, M.: Formal algorithms for transformers. arXiv preprint arXiv:2207.09238 (2022) Ji et al. [2023] Ji, Z., Lee, N., Frieske, R., Yu, T., Su, D., Xu, Y., Ishii, E., Bang, Y.J., Madotto, A., Fung, P.: Survey of hallucination in natural language generation. ACM Comput. Surv. 55(12) (2023) https://doi.org/10.1145/3571730 Strubell et al. [2020] Strubell, E., Ganesh, A., McCallum, A.: Energy and policy considerations for modern deep learning research. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, pp. 13693–13696 (2020) Patterson et al. [2021] Patterson, D., Gonzalez, J., Le, Q., Liang, C., Munguia, L.-M., Rothchild, D., So, D., Texier, M., Dean, J.: Carbon emissions and large neural network training. arXiv preprint arXiv:2104.10350 (2021) Wu et al. [2022] Wu, C.-J., Raghavendra, R., Gupta, U., Acun, B., Ardalani, N., Maeng, K., Chang, G., Aga, F., Huang, J., Bai, C., et al.: Sustainable ai: Environmental implications, challenges and opportunities. Proceedings of Machine Learning and Systems 4, 795–813 (2022) Kaack et al. [2022] Kaack, L.H., Donti, P.L., Strubell, E., Kamiya, G., Creutzig, F., Rolnick, D.: Aligning artificial intelligence with climate change mitigation. Nature Climate Change 12(6), 518–527 (2022) Zhou et al. [2022] Zhou, H., Nova, A., Larochelle, H., Courville, A., Neyshabur, B., Sedghi, H.: Teaching algorithmic reasoning via in-context learning. arXiv preprint arXiv:2211.09066 (2022) Izacard et al. [2022] Izacard, G., Lewis, P., Lomeli, M., Hosseini, L., Petroni, F., Schick, T., Dwivedi-Yu, J., Joulin, A., Riedel, S., Grave, E.: Few-shot learning with retrieval augmented language models. arXiv preprint arXiv:2208.03299 (2022) Wei et al. [2022] Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E., Le, Q., Zhou, D.: Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903 (2022) Wang et al. [2023] Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-Consistency Improves Chain of Thought Reasoning in Language Models (2023) Shao et al. [2023] Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Phuong, M., Hutter, M.: Formal algorithms for transformers. arXiv preprint arXiv:2207.09238 (2022) Ji et al. [2023] Ji, Z., Lee, N., Frieske, R., Yu, T., Su, D., Xu, Y., Ishii, E., Bang, Y.J., Madotto, A., Fung, P.: Survey of hallucination in natural language generation. ACM Comput. Surv. 55(12) (2023) https://doi.org/10.1145/3571730 Strubell et al. [2020] Strubell, E., Ganesh, A., McCallum, A.: Energy and policy considerations for modern deep learning research. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, pp. 13693–13696 (2020) Patterson et al. [2021] Patterson, D., Gonzalez, J., Le, Q., Liang, C., Munguia, L.-M., Rothchild, D., So, D., Texier, M., Dean, J.: Carbon emissions and large neural network training. arXiv preprint arXiv:2104.10350 (2021) Wu et al. [2022] Wu, C.-J., Raghavendra, R., Gupta, U., Acun, B., Ardalani, N., Maeng, K., Chang, G., Aga, F., Huang, J., Bai, C., et al.: Sustainable ai: Environmental implications, challenges and opportunities. Proceedings of Machine Learning and Systems 4, 795–813 (2022) Kaack et al. [2022] Kaack, L.H., Donti, P.L., Strubell, E., Kamiya, G., Creutzig, F., Rolnick, D.: Aligning artificial intelligence with climate change mitigation. Nature Climate Change 12(6), 518–527 (2022) Zhou et al. [2022] Zhou, H., Nova, A., Larochelle, H., Courville, A., Neyshabur, B., Sedghi, H.: Teaching algorithmic reasoning via in-context learning. arXiv preprint arXiv:2211.09066 (2022) Izacard et al. [2022] Izacard, G., Lewis, P., Lomeli, M., Hosseini, L., Petroni, F., Schick, T., Dwivedi-Yu, J., Joulin, A., Riedel, S., Grave, E.: Few-shot learning with retrieval augmented language models. arXiv preprint arXiv:2208.03299 (2022) Wei et al. [2022] Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E., Le, Q., Zhou, D.: Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903 (2022) Wang et al. [2023] Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-Consistency Improves Chain of Thought Reasoning in Language Models (2023) Shao et al. [2023] Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Ji, Z., Lee, N., Frieske, R., Yu, T., Su, D., Xu, Y., Ishii, E., Bang, Y.J., Madotto, A., Fung, P.: Survey of hallucination in natural language generation. ACM Comput. Surv. 55(12) (2023) https://doi.org/10.1145/3571730 Strubell et al. [2020] Strubell, E., Ganesh, A., McCallum, A.: Energy and policy considerations for modern deep learning research. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, pp. 13693–13696 (2020) Patterson et al. [2021] Patterson, D., Gonzalez, J., Le, Q., Liang, C., Munguia, L.-M., Rothchild, D., So, D., Texier, M., Dean, J.: Carbon emissions and large neural network training. arXiv preprint arXiv:2104.10350 (2021) Wu et al. [2022] Wu, C.-J., Raghavendra, R., Gupta, U., Acun, B., Ardalani, N., Maeng, K., Chang, G., Aga, F., Huang, J., Bai, C., et al.: Sustainable ai: Environmental implications, challenges and opportunities. Proceedings of Machine Learning and Systems 4, 795–813 (2022) Kaack et al. [2022] Kaack, L.H., Donti, P.L., Strubell, E., Kamiya, G., Creutzig, F., Rolnick, D.: Aligning artificial intelligence with climate change mitigation. Nature Climate Change 12(6), 518–527 (2022) Zhou et al. [2022] Zhou, H., Nova, A., Larochelle, H., Courville, A., Neyshabur, B., Sedghi, H.: Teaching algorithmic reasoning via in-context learning. arXiv preprint arXiv:2211.09066 (2022) Izacard et al. [2022] Izacard, G., Lewis, P., Lomeli, M., Hosseini, L., Petroni, F., Schick, T., Dwivedi-Yu, J., Joulin, A., Riedel, S., Grave, E.: Few-shot learning with retrieval augmented language models. arXiv preprint arXiv:2208.03299 (2022) Wei et al. [2022] Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E., Le, Q., Zhou, D.: Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903 (2022) Wang et al. [2023] Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-Consistency Improves Chain of Thought Reasoning in Language Models (2023) Shao et al. [2023] Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Strubell, E., Ganesh, A., McCallum, A.: Energy and policy considerations for modern deep learning research. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, pp. 13693–13696 (2020) Patterson et al. [2021] Patterson, D., Gonzalez, J., Le, Q., Liang, C., Munguia, L.-M., Rothchild, D., So, D., Texier, M., Dean, J.: Carbon emissions and large neural network training. arXiv preprint arXiv:2104.10350 (2021) Wu et al. [2022] Wu, C.-J., Raghavendra, R., Gupta, U., Acun, B., Ardalani, N., Maeng, K., Chang, G., Aga, F., Huang, J., Bai, C., et al.: Sustainable ai: Environmental implications, challenges and opportunities. Proceedings of Machine Learning and Systems 4, 795–813 (2022) Kaack et al. [2022] Kaack, L.H., Donti, P.L., Strubell, E., Kamiya, G., Creutzig, F., Rolnick, D.: Aligning artificial intelligence with climate change mitigation. Nature Climate Change 12(6), 518–527 (2022) Zhou et al. [2022] Zhou, H., Nova, A., Larochelle, H., Courville, A., Neyshabur, B., Sedghi, H.: Teaching algorithmic reasoning via in-context learning. arXiv preprint arXiv:2211.09066 (2022) Izacard et al. [2022] Izacard, G., Lewis, P., Lomeli, M., Hosseini, L., Petroni, F., Schick, T., Dwivedi-Yu, J., Joulin, A., Riedel, S., Grave, E.: Few-shot learning with retrieval augmented language models. arXiv preprint arXiv:2208.03299 (2022) Wei et al. [2022] Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E., Le, Q., Zhou, D.: Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903 (2022) Wang et al. [2023] Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-Consistency Improves Chain of Thought Reasoning in Language Models (2023) Shao et al. [2023] Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Patterson, D., Gonzalez, J., Le, Q., Liang, C., Munguia, L.-M., Rothchild, D., So, D., Texier, M., Dean, J.: Carbon emissions and large neural network training. arXiv preprint arXiv:2104.10350 (2021) Wu et al. [2022] Wu, C.-J., Raghavendra, R., Gupta, U., Acun, B., Ardalani, N., Maeng, K., Chang, G., Aga, F., Huang, J., Bai, C., et al.: Sustainable ai: Environmental implications, challenges and opportunities. Proceedings of Machine Learning and Systems 4, 795–813 (2022) Kaack et al. [2022] Kaack, L.H., Donti, P.L., Strubell, E., Kamiya, G., Creutzig, F., Rolnick, D.: Aligning artificial intelligence with climate change mitigation. Nature Climate Change 12(6), 518–527 (2022) Zhou et al. [2022] Zhou, H., Nova, A., Larochelle, H., Courville, A., Neyshabur, B., Sedghi, H.: Teaching algorithmic reasoning via in-context learning. arXiv preprint arXiv:2211.09066 (2022) Izacard et al. [2022] Izacard, G., Lewis, P., Lomeli, M., Hosseini, L., Petroni, F., Schick, T., Dwivedi-Yu, J., Joulin, A., Riedel, S., Grave, E.: Few-shot learning with retrieval augmented language models. arXiv preprint arXiv:2208.03299 (2022) Wei et al. [2022] Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E., Le, Q., Zhou, D.: Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903 (2022) Wang et al. [2023] Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-Consistency Improves Chain of Thought Reasoning in Language Models (2023) Shao et al. [2023] Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Wu, C.-J., Raghavendra, R., Gupta, U., Acun, B., Ardalani, N., Maeng, K., Chang, G., Aga, F., Huang, J., Bai, C., et al.: Sustainable ai: Environmental implications, challenges and opportunities. Proceedings of Machine Learning and Systems 4, 795–813 (2022) Kaack et al. [2022] Kaack, L.H., Donti, P.L., Strubell, E., Kamiya, G., Creutzig, F., Rolnick, D.: Aligning artificial intelligence with climate change mitigation. Nature Climate Change 12(6), 518–527 (2022) Zhou et al. [2022] Zhou, H., Nova, A., Larochelle, H., Courville, A., Neyshabur, B., Sedghi, H.: Teaching algorithmic reasoning via in-context learning. arXiv preprint arXiv:2211.09066 (2022) Izacard et al. [2022] Izacard, G., Lewis, P., Lomeli, M., Hosseini, L., Petroni, F., Schick, T., Dwivedi-Yu, J., Joulin, A., Riedel, S., Grave, E.: Few-shot learning with retrieval augmented language models. arXiv preprint arXiv:2208.03299 (2022) Wei et al. [2022] Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E., Le, Q., Zhou, D.: Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903 (2022) Wang et al. [2023] Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-Consistency Improves Chain of Thought Reasoning in Language Models (2023) Shao et al. [2023] Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Kaack, L.H., Donti, P.L., Strubell, E., Kamiya, G., Creutzig, F., Rolnick, D.: Aligning artificial intelligence with climate change mitigation. Nature Climate Change 12(6), 518–527 (2022) Zhou et al. [2022] Zhou, H., Nova, A., Larochelle, H., Courville, A., Neyshabur, B., Sedghi, H.: Teaching algorithmic reasoning via in-context learning. arXiv preprint arXiv:2211.09066 (2022) Izacard et al. [2022] Izacard, G., Lewis, P., Lomeli, M., Hosseini, L., Petroni, F., Schick, T., Dwivedi-Yu, J., Joulin, A., Riedel, S., Grave, E.: Few-shot learning with retrieval augmented language models. arXiv preprint arXiv:2208.03299 (2022) Wei et al. [2022] Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E., Le, Q., Zhou, D.: Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903 (2022) Wang et al. [2023] Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-Consistency Improves Chain of Thought Reasoning in Language Models (2023) Shao et al. [2023] Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Zhou, H., Nova, A., Larochelle, H., Courville, A., Neyshabur, B., Sedghi, H.: Teaching algorithmic reasoning via in-context learning. arXiv preprint arXiv:2211.09066 (2022) Izacard et al. [2022] Izacard, G., Lewis, P., Lomeli, M., Hosseini, L., Petroni, F., Schick, T., Dwivedi-Yu, J., Joulin, A., Riedel, S., Grave, E.: Few-shot learning with retrieval augmented language models. arXiv preprint arXiv:2208.03299 (2022) Wei et al. [2022] Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E., Le, Q., Zhou, D.: Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903 (2022) Wang et al. [2023] Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-Consistency Improves Chain of Thought Reasoning in Language Models (2023) Shao et al. [2023] Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Izacard, G., Lewis, P., Lomeli, M., Hosseini, L., Petroni, F., Schick, T., Dwivedi-Yu, J., Joulin, A., Riedel, S., Grave, E.: Few-shot learning with retrieval augmented language models. arXiv preprint arXiv:2208.03299 (2022) Wei et al. [2022] Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E., Le, Q., Zhou, D.: Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903 (2022) Wang et al. [2023] Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-Consistency Improves Chain of Thought Reasoning in Language Models (2023) Shao et al. [2023] Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E., Le, Q., Zhou, D.: Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903 (2022) Wang et al. [2023] Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-Consistency Improves Chain of Thought Reasoning in Language Models (2023) Shao et al. [2023] Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-Consistency Improves Chain of Thought Reasoning in Language Models (2023) Shao et al. [2023] Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023)
- Liventsev, V., Grishina, A., Härmä, A., Moonen, L.: Fully autonomous programming with large language models. arXiv preprint arXiv:2304.10423 (2023) O’Neill et al. [2010] O’Neill, M., Vanneschi, L., Gustafson, S., Banzhaf, W.: Open issues in genetic programming. Genetic Programming and Evolvable Machines 11, 339–363 (2010) O’Neill and Spector [2020] O’Neill, M., Spector, L.: Automatic programming: The open issue? Genetic Programming and Evolvable Machines 21, 251–262 (2020) Liu et al. [2023] Liu, P., Yuan, W., Fu, J., Jiang, Z., Hayashi, H., Neubig, G.: Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing. ACM Computing Surveys 55(9), 1–35 (2023) Radford et al. [2019] Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., Sutskever, I., et al.: Language models are unsupervised multitask learners. OpenAI blog 1(8), 9 (2019) Brown et al. [2020] Brown, T.B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D.M., Wu, J., Winter, C., Hesse, C., Chen, M., Sigler, E., Litwin, M., Gray, S., Chess, B., Clark, J., Berner, C., McCandlish, S., Radford, A., Sutskever, I., Amodei, D.: Language Models are Few-Shot Learners (2020) OpenAI [2023] OpenAI: GPT-4 Technical Report (2023) Phuong and Hutter [2022] Phuong, M., Hutter, M.: Formal algorithms for transformers. arXiv preprint arXiv:2207.09238 (2022) Ji et al. [2023] Ji, Z., Lee, N., Frieske, R., Yu, T., Su, D., Xu, Y., Ishii, E., Bang, Y.J., Madotto, A., Fung, P.: Survey of hallucination in natural language generation. ACM Comput. Surv. 55(12) (2023) https://doi.org/10.1145/3571730 Strubell et al. [2020] Strubell, E., Ganesh, A., McCallum, A.: Energy and policy considerations for modern deep learning research. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, pp. 13693–13696 (2020) Patterson et al. [2021] Patterson, D., Gonzalez, J., Le, Q., Liang, C., Munguia, L.-M., Rothchild, D., So, D., Texier, M., Dean, J.: Carbon emissions and large neural network training. arXiv preprint arXiv:2104.10350 (2021) Wu et al. [2022] Wu, C.-J., Raghavendra, R., Gupta, U., Acun, B., Ardalani, N., Maeng, K., Chang, G., Aga, F., Huang, J., Bai, C., et al.: Sustainable ai: Environmental implications, challenges and opportunities. Proceedings of Machine Learning and Systems 4, 795–813 (2022) Kaack et al. [2022] Kaack, L.H., Donti, P.L., Strubell, E., Kamiya, G., Creutzig, F., Rolnick, D.: Aligning artificial intelligence with climate change mitigation. Nature Climate Change 12(6), 518–527 (2022) Zhou et al. [2022] Zhou, H., Nova, A., Larochelle, H., Courville, A., Neyshabur, B., Sedghi, H.: Teaching algorithmic reasoning via in-context learning. arXiv preprint arXiv:2211.09066 (2022) Izacard et al. [2022] Izacard, G., Lewis, P., Lomeli, M., Hosseini, L., Petroni, F., Schick, T., Dwivedi-Yu, J., Joulin, A., Riedel, S., Grave, E.: Few-shot learning with retrieval augmented language models. arXiv preprint arXiv:2208.03299 (2022) Wei et al. [2022] Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E., Le, Q., Zhou, D.: Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903 (2022) Wang et al. [2023] Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-Consistency Improves Chain of Thought Reasoning in Language Models (2023) Shao et al. [2023] Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) O’Neill, M., Vanneschi, L., Gustafson, S., Banzhaf, W.: Open issues in genetic programming. Genetic Programming and Evolvable Machines 11, 339–363 (2010) O’Neill and Spector [2020] O’Neill, M., Spector, L.: Automatic programming: The open issue? Genetic Programming and Evolvable Machines 21, 251–262 (2020) Liu et al. [2023] Liu, P., Yuan, W., Fu, J., Jiang, Z., Hayashi, H., Neubig, G.: Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing. ACM Computing Surveys 55(9), 1–35 (2023) Radford et al. [2019] Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., Sutskever, I., et al.: Language models are unsupervised multitask learners. OpenAI blog 1(8), 9 (2019) Brown et al. [2020] Brown, T.B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D.M., Wu, J., Winter, C., Hesse, C., Chen, M., Sigler, E., Litwin, M., Gray, S., Chess, B., Clark, J., Berner, C., McCandlish, S., Radford, A., Sutskever, I., Amodei, D.: Language Models are Few-Shot Learners (2020) OpenAI [2023] OpenAI: GPT-4 Technical Report (2023) Phuong and Hutter [2022] Phuong, M., Hutter, M.: Formal algorithms for transformers. arXiv preprint arXiv:2207.09238 (2022) Ji et al. [2023] Ji, Z., Lee, N., Frieske, R., Yu, T., Su, D., Xu, Y., Ishii, E., Bang, Y.J., Madotto, A., Fung, P.: Survey of hallucination in natural language generation. ACM Comput. Surv. 55(12) (2023) https://doi.org/10.1145/3571730 Strubell et al. [2020] Strubell, E., Ganesh, A., McCallum, A.: Energy and policy considerations for modern deep learning research. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, pp. 13693–13696 (2020) Patterson et al. [2021] Patterson, D., Gonzalez, J., Le, Q., Liang, C., Munguia, L.-M., Rothchild, D., So, D., Texier, M., Dean, J.: Carbon emissions and large neural network training. arXiv preprint arXiv:2104.10350 (2021) Wu et al. [2022] Wu, C.-J., Raghavendra, R., Gupta, U., Acun, B., Ardalani, N., Maeng, K., Chang, G., Aga, F., Huang, J., Bai, C., et al.: Sustainable ai: Environmental implications, challenges and opportunities. Proceedings of Machine Learning and Systems 4, 795–813 (2022) Kaack et al. [2022] Kaack, L.H., Donti, P.L., Strubell, E., Kamiya, G., Creutzig, F., Rolnick, D.: Aligning artificial intelligence with climate change mitigation. Nature Climate Change 12(6), 518–527 (2022) Zhou et al. [2022] Zhou, H., Nova, A., Larochelle, H., Courville, A., Neyshabur, B., Sedghi, H.: Teaching algorithmic reasoning via in-context learning. arXiv preprint arXiv:2211.09066 (2022) Izacard et al. [2022] Izacard, G., Lewis, P., Lomeli, M., Hosseini, L., Petroni, F., Schick, T., Dwivedi-Yu, J., Joulin, A., Riedel, S., Grave, E.: Few-shot learning with retrieval augmented language models. arXiv preprint arXiv:2208.03299 (2022) Wei et al. [2022] Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E., Le, Q., Zhou, D.: Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903 (2022) Wang et al. [2023] Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-Consistency Improves Chain of Thought Reasoning in Language Models (2023) Shao et al. [2023] Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) O’Neill, M., Spector, L.: Automatic programming: The open issue? Genetic Programming and Evolvable Machines 21, 251–262 (2020) Liu et al. [2023] Liu, P., Yuan, W., Fu, J., Jiang, Z., Hayashi, H., Neubig, G.: Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing. ACM Computing Surveys 55(9), 1–35 (2023) Radford et al. [2019] Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., Sutskever, I., et al.: Language models are unsupervised multitask learners. OpenAI blog 1(8), 9 (2019) Brown et al. [2020] Brown, T.B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D.M., Wu, J., Winter, C., Hesse, C., Chen, M., Sigler, E., Litwin, M., Gray, S., Chess, B., Clark, J., Berner, C., McCandlish, S., Radford, A., Sutskever, I., Amodei, D.: Language Models are Few-Shot Learners (2020) OpenAI [2023] OpenAI: GPT-4 Technical Report (2023) Phuong and Hutter [2022] Phuong, M., Hutter, M.: Formal algorithms for transformers. arXiv preprint arXiv:2207.09238 (2022) Ji et al. [2023] Ji, Z., Lee, N., Frieske, R., Yu, T., Su, D., Xu, Y., Ishii, E., Bang, Y.J., Madotto, A., Fung, P.: Survey of hallucination in natural language generation. ACM Comput. Surv. 55(12) (2023) https://doi.org/10.1145/3571730 Strubell et al. [2020] Strubell, E., Ganesh, A., McCallum, A.: Energy and policy considerations for modern deep learning research. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, pp. 13693–13696 (2020) Patterson et al. [2021] Patterson, D., Gonzalez, J., Le, Q., Liang, C., Munguia, L.-M., Rothchild, D., So, D., Texier, M., Dean, J.: Carbon emissions and large neural network training. arXiv preprint arXiv:2104.10350 (2021) Wu et al. [2022] Wu, C.-J., Raghavendra, R., Gupta, U., Acun, B., Ardalani, N., Maeng, K., Chang, G., Aga, F., Huang, J., Bai, C., et al.: Sustainable ai: Environmental implications, challenges and opportunities. Proceedings of Machine Learning and Systems 4, 795–813 (2022) Kaack et al. [2022] Kaack, L.H., Donti, P.L., Strubell, E., Kamiya, G., Creutzig, F., Rolnick, D.: Aligning artificial intelligence with climate change mitigation. Nature Climate Change 12(6), 518–527 (2022) Zhou et al. [2022] Zhou, H., Nova, A., Larochelle, H., Courville, A., Neyshabur, B., Sedghi, H.: Teaching algorithmic reasoning via in-context learning. arXiv preprint arXiv:2211.09066 (2022) Izacard et al. [2022] Izacard, G., Lewis, P., Lomeli, M., Hosseini, L., Petroni, F., Schick, T., Dwivedi-Yu, J., Joulin, A., Riedel, S., Grave, E.: Few-shot learning with retrieval augmented language models. arXiv preprint arXiv:2208.03299 (2022) Wei et al. [2022] Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E., Le, Q., Zhou, D.: Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903 (2022) Wang et al. [2023] Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-Consistency Improves Chain of Thought Reasoning in Language Models (2023) Shao et al. [2023] Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Liu, P., Yuan, W., Fu, J., Jiang, Z., Hayashi, H., Neubig, G.: Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing. ACM Computing Surveys 55(9), 1–35 (2023) Radford et al. [2019] Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., Sutskever, I., et al.: Language models are unsupervised multitask learners. OpenAI blog 1(8), 9 (2019) Brown et al. [2020] Brown, T.B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D.M., Wu, J., Winter, C., Hesse, C., Chen, M., Sigler, E., Litwin, M., Gray, S., Chess, B., Clark, J., Berner, C., McCandlish, S., Radford, A., Sutskever, I., Amodei, D.: Language Models are Few-Shot Learners (2020) OpenAI [2023] OpenAI: GPT-4 Technical Report (2023) Phuong and Hutter [2022] Phuong, M., Hutter, M.: Formal algorithms for transformers. arXiv preprint arXiv:2207.09238 (2022) Ji et al. [2023] Ji, Z., Lee, N., Frieske, R., Yu, T., Su, D., Xu, Y., Ishii, E., Bang, Y.J., Madotto, A., Fung, P.: Survey of hallucination in natural language generation. ACM Comput. Surv. 55(12) (2023) https://doi.org/10.1145/3571730 Strubell et al. [2020] Strubell, E., Ganesh, A., McCallum, A.: Energy and policy considerations for modern deep learning research. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, pp. 13693–13696 (2020) Patterson et al. [2021] Patterson, D., Gonzalez, J., Le, Q., Liang, C., Munguia, L.-M., Rothchild, D., So, D., Texier, M., Dean, J.: Carbon emissions and large neural network training. arXiv preprint arXiv:2104.10350 (2021) Wu et al. [2022] Wu, C.-J., Raghavendra, R., Gupta, U., Acun, B., Ardalani, N., Maeng, K., Chang, G., Aga, F., Huang, J., Bai, C., et al.: Sustainable ai: Environmental implications, challenges and opportunities. Proceedings of Machine Learning and Systems 4, 795–813 (2022) Kaack et al. [2022] Kaack, L.H., Donti, P.L., Strubell, E., Kamiya, G., Creutzig, F., Rolnick, D.: Aligning artificial intelligence with climate change mitigation. Nature Climate Change 12(6), 518–527 (2022) Zhou et al. [2022] Zhou, H., Nova, A., Larochelle, H., Courville, A., Neyshabur, B., Sedghi, H.: Teaching algorithmic reasoning via in-context learning. arXiv preprint arXiv:2211.09066 (2022) Izacard et al. [2022] Izacard, G., Lewis, P., Lomeli, M., Hosseini, L., Petroni, F., Schick, T., Dwivedi-Yu, J., Joulin, A., Riedel, S., Grave, E.: Few-shot learning with retrieval augmented language models. arXiv preprint arXiv:2208.03299 (2022) Wei et al. [2022] Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E., Le, Q., Zhou, D.: Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903 (2022) Wang et al. [2023] Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-Consistency Improves Chain of Thought Reasoning in Language Models (2023) Shao et al. [2023] Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., Sutskever, I., et al.: Language models are unsupervised multitask learners. OpenAI blog 1(8), 9 (2019) Brown et al. [2020] Brown, T.B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D.M., Wu, J., Winter, C., Hesse, C., Chen, M., Sigler, E., Litwin, M., Gray, S., Chess, B., Clark, J., Berner, C., McCandlish, S., Radford, A., Sutskever, I., Amodei, D.: Language Models are Few-Shot Learners (2020) OpenAI [2023] OpenAI: GPT-4 Technical Report (2023) Phuong and Hutter [2022] Phuong, M., Hutter, M.: Formal algorithms for transformers. arXiv preprint arXiv:2207.09238 (2022) Ji et al. [2023] Ji, Z., Lee, N., Frieske, R., Yu, T., Su, D., Xu, Y., Ishii, E., Bang, Y.J., Madotto, A., Fung, P.: Survey of hallucination in natural language generation. ACM Comput. Surv. 55(12) (2023) https://doi.org/10.1145/3571730 Strubell et al. [2020] Strubell, E., Ganesh, A., McCallum, A.: Energy and policy considerations for modern deep learning research. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, pp. 13693–13696 (2020) Patterson et al. [2021] Patterson, D., Gonzalez, J., Le, Q., Liang, C., Munguia, L.-M., Rothchild, D., So, D., Texier, M., Dean, J.: Carbon emissions and large neural network training. arXiv preprint arXiv:2104.10350 (2021) Wu et al. [2022] Wu, C.-J., Raghavendra, R., Gupta, U., Acun, B., Ardalani, N., Maeng, K., Chang, G., Aga, F., Huang, J., Bai, C., et al.: Sustainable ai: Environmental implications, challenges and opportunities. Proceedings of Machine Learning and Systems 4, 795–813 (2022) Kaack et al. [2022] Kaack, L.H., Donti, P.L., Strubell, E., Kamiya, G., Creutzig, F., Rolnick, D.: Aligning artificial intelligence with climate change mitigation. Nature Climate Change 12(6), 518–527 (2022) Zhou et al. [2022] Zhou, H., Nova, A., Larochelle, H., Courville, A., Neyshabur, B., Sedghi, H.: Teaching algorithmic reasoning via in-context learning. arXiv preprint arXiv:2211.09066 (2022) Izacard et al. [2022] Izacard, G., Lewis, P., Lomeli, M., Hosseini, L., Petroni, F., Schick, T., Dwivedi-Yu, J., Joulin, A., Riedel, S., Grave, E.: Few-shot learning with retrieval augmented language models. arXiv preprint arXiv:2208.03299 (2022) Wei et al. [2022] Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E., Le, Q., Zhou, D.: Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903 (2022) Wang et al. [2023] Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-Consistency Improves Chain of Thought Reasoning in Language Models (2023) Shao et al. [2023] Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Brown, T.B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D.M., Wu, J., Winter, C., Hesse, C., Chen, M., Sigler, E., Litwin, M., Gray, S., Chess, B., Clark, J., Berner, C., McCandlish, S., Radford, A., Sutskever, I., Amodei, D.: Language Models are Few-Shot Learners (2020) OpenAI [2023] OpenAI: GPT-4 Technical Report (2023) Phuong and Hutter [2022] Phuong, M., Hutter, M.: Formal algorithms for transformers. arXiv preprint arXiv:2207.09238 (2022) Ji et al. [2023] Ji, Z., Lee, N., Frieske, R., Yu, T., Su, D., Xu, Y., Ishii, E., Bang, Y.J., Madotto, A., Fung, P.: Survey of hallucination in natural language generation. ACM Comput. Surv. 55(12) (2023) https://doi.org/10.1145/3571730 Strubell et al. [2020] Strubell, E., Ganesh, A., McCallum, A.: Energy and policy considerations for modern deep learning research. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, pp. 13693–13696 (2020) Patterson et al. [2021] Patterson, D., Gonzalez, J., Le, Q., Liang, C., Munguia, L.-M., Rothchild, D., So, D., Texier, M., Dean, J.: Carbon emissions and large neural network training. arXiv preprint arXiv:2104.10350 (2021) Wu et al. [2022] Wu, C.-J., Raghavendra, R., Gupta, U., Acun, B., Ardalani, N., Maeng, K., Chang, G., Aga, F., Huang, J., Bai, C., et al.: Sustainable ai: Environmental implications, challenges and opportunities. Proceedings of Machine Learning and Systems 4, 795–813 (2022) Kaack et al. [2022] Kaack, L.H., Donti, P.L., Strubell, E., Kamiya, G., Creutzig, F., Rolnick, D.: Aligning artificial intelligence with climate change mitigation. Nature Climate Change 12(6), 518–527 (2022) Zhou et al. [2022] Zhou, H., Nova, A., Larochelle, H., Courville, A., Neyshabur, B., Sedghi, H.: Teaching algorithmic reasoning via in-context learning. arXiv preprint arXiv:2211.09066 (2022) Izacard et al. [2022] Izacard, G., Lewis, P., Lomeli, M., Hosseini, L., Petroni, F., Schick, T., Dwivedi-Yu, J., Joulin, A., Riedel, S., Grave, E.: Few-shot learning with retrieval augmented language models. arXiv preprint arXiv:2208.03299 (2022) Wei et al. [2022] Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E., Le, Q., Zhou, D.: Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903 (2022) Wang et al. [2023] Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-Consistency Improves Chain of Thought Reasoning in Language Models (2023) Shao et al. [2023] Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) OpenAI: GPT-4 Technical Report (2023) Phuong and Hutter [2022] Phuong, M., Hutter, M.: Formal algorithms for transformers. arXiv preprint arXiv:2207.09238 (2022) Ji et al. [2023] Ji, Z., Lee, N., Frieske, R., Yu, T., Su, D., Xu, Y., Ishii, E., Bang, Y.J., Madotto, A., Fung, P.: Survey of hallucination in natural language generation. ACM Comput. Surv. 55(12) (2023) https://doi.org/10.1145/3571730 Strubell et al. [2020] Strubell, E., Ganesh, A., McCallum, A.: Energy and policy considerations for modern deep learning research. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, pp. 13693–13696 (2020) Patterson et al. [2021] Patterson, D., Gonzalez, J., Le, Q., Liang, C., Munguia, L.-M., Rothchild, D., So, D., Texier, M., Dean, J.: Carbon emissions and large neural network training. arXiv preprint arXiv:2104.10350 (2021) Wu et al. [2022] Wu, C.-J., Raghavendra, R., Gupta, U., Acun, B., Ardalani, N., Maeng, K., Chang, G., Aga, F., Huang, J., Bai, C., et al.: Sustainable ai: Environmental implications, challenges and opportunities. Proceedings of Machine Learning and Systems 4, 795–813 (2022) Kaack et al. [2022] Kaack, L.H., Donti, P.L., Strubell, E., Kamiya, G., Creutzig, F., Rolnick, D.: Aligning artificial intelligence with climate change mitigation. Nature Climate Change 12(6), 518–527 (2022) Zhou et al. [2022] Zhou, H., Nova, A., Larochelle, H., Courville, A., Neyshabur, B., Sedghi, H.: Teaching algorithmic reasoning via in-context learning. arXiv preprint arXiv:2211.09066 (2022) Izacard et al. [2022] Izacard, G., Lewis, P., Lomeli, M., Hosseini, L., Petroni, F., Schick, T., Dwivedi-Yu, J., Joulin, A., Riedel, S., Grave, E.: Few-shot learning with retrieval augmented language models. arXiv preprint arXiv:2208.03299 (2022) Wei et al. [2022] Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E., Le, Q., Zhou, D.: Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903 (2022) Wang et al. [2023] Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-Consistency Improves Chain of Thought Reasoning in Language Models (2023) Shao et al. [2023] Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Phuong, M., Hutter, M.: Formal algorithms for transformers. arXiv preprint arXiv:2207.09238 (2022) Ji et al. [2023] Ji, Z., Lee, N., Frieske, R., Yu, T., Su, D., Xu, Y., Ishii, E., Bang, Y.J., Madotto, A., Fung, P.: Survey of hallucination in natural language generation. ACM Comput. Surv. 55(12) (2023) https://doi.org/10.1145/3571730 Strubell et al. [2020] Strubell, E., Ganesh, A., McCallum, A.: Energy and policy considerations for modern deep learning research. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, pp. 13693–13696 (2020) Patterson et al. [2021] Patterson, D., Gonzalez, J., Le, Q., Liang, C., Munguia, L.-M., Rothchild, D., So, D., Texier, M., Dean, J.: Carbon emissions and large neural network training. arXiv preprint arXiv:2104.10350 (2021) Wu et al. [2022] Wu, C.-J., Raghavendra, R., Gupta, U., Acun, B., Ardalani, N., Maeng, K., Chang, G., Aga, F., Huang, J., Bai, C., et al.: Sustainable ai: Environmental implications, challenges and opportunities. Proceedings of Machine Learning and Systems 4, 795–813 (2022) Kaack et al. [2022] Kaack, L.H., Donti, P.L., Strubell, E., Kamiya, G., Creutzig, F., Rolnick, D.: Aligning artificial intelligence with climate change mitigation. Nature Climate Change 12(6), 518–527 (2022) Zhou et al. [2022] Zhou, H., Nova, A., Larochelle, H., Courville, A., Neyshabur, B., Sedghi, H.: Teaching algorithmic reasoning via in-context learning. arXiv preprint arXiv:2211.09066 (2022) Izacard et al. [2022] Izacard, G., Lewis, P., Lomeli, M., Hosseini, L., Petroni, F., Schick, T., Dwivedi-Yu, J., Joulin, A., Riedel, S., Grave, E.: Few-shot learning with retrieval augmented language models. arXiv preprint arXiv:2208.03299 (2022) Wei et al. [2022] Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E., Le, Q., Zhou, D.: Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903 (2022) Wang et al. [2023] Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-Consistency Improves Chain of Thought Reasoning in Language Models (2023) Shao et al. [2023] Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Ji, Z., Lee, N., Frieske, R., Yu, T., Su, D., Xu, Y., Ishii, E., Bang, Y.J., Madotto, A., Fung, P.: Survey of hallucination in natural language generation. ACM Comput. Surv. 55(12) (2023) https://doi.org/10.1145/3571730 Strubell et al. [2020] Strubell, E., Ganesh, A., McCallum, A.: Energy and policy considerations for modern deep learning research. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, pp. 13693–13696 (2020) Patterson et al. [2021] Patterson, D., Gonzalez, J., Le, Q., Liang, C., Munguia, L.-M., Rothchild, D., So, D., Texier, M., Dean, J.: Carbon emissions and large neural network training. arXiv preprint arXiv:2104.10350 (2021) Wu et al. [2022] Wu, C.-J., Raghavendra, R., Gupta, U., Acun, B., Ardalani, N., Maeng, K., Chang, G., Aga, F., Huang, J., Bai, C., et al.: Sustainable ai: Environmental implications, challenges and opportunities. Proceedings of Machine Learning and Systems 4, 795–813 (2022) Kaack et al. [2022] Kaack, L.H., Donti, P.L., Strubell, E., Kamiya, G., Creutzig, F., Rolnick, D.: Aligning artificial intelligence with climate change mitigation. Nature Climate Change 12(6), 518–527 (2022) Zhou et al. [2022] Zhou, H., Nova, A., Larochelle, H., Courville, A., Neyshabur, B., Sedghi, H.: Teaching algorithmic reasoning via in-context learning. arXiv preprint arXiv:2211.09066 (2022) Izacard et al. [2022] Izacard, G., Lewis, P., Lomeli, M., Hosseini, L., Petroni, F., Schick, T., Dwivedi-Yu, J., Joulin, A., Riedel, S., Grave, E.: Few-shot learning with retrieval augmented language models. arXiv preprint arXiv:2208.03299 (2022) Wei et al. [2022] Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E., Le, Q., Zhou, D.: Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903 (2022) Wang et al. [2023] Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-Consistency Improves Chain of Thought Reasoning in Language Models (2023) Shao et al. [2023] Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Strubell, E., Ganesh, A., McCallum, A.: Energy and policy considerations for modern deep learning research. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, pp. 13693–13696 (2020) Patterson et al. [2021] Patterson, D., Gonzalez, J., Le, Q., Liang, C., Munguia, L.-M., Rothchild, D., So, D., Texier, M., Dean, J.: Carbon emissions and large neural network training. arXiv preprint arXiv:2104.10350 (2021) Wu et al. [2022] Wu, C.-J., Raghavendra, R., Gupta, U., Acun, B., Ardalani, N., Maeng, K., Chang, G., Aga, F., Huang, J., Bai, C., et al.: Sustainable ai: Environmental implications, challenges and opportunities. Proceedings of Machine Learning and Systems 4, 795–813 (2022) Kaack et al. [2022] Kaack, L.H., Donti, P.L., Strubell, E., Kamiya, G., Creutzig, F., Rolnick, D.: Aligning artificial intelligence with climate change mitigation. Nature Climate Change 12(6), 518–527 (2022) Zhou et al. [2022] Zhou, H., Nova, A., Larochelle, H., Courville, A., Neyshabur, B., Sedghi, H.: Teaching algorithmic reasoning via in-context learning. arXiv preprint arXiv:2211.09066 (2022) Izacard et al. [2022] Izacard, G., Lewis, P., Lomeli, M., Hosseini, L., Petroni, F., Schick, T., Dwivedi-Yu, J., Joulin, A., Riedel, S., Grave, E.: Few-shot learning with retrieval augmented language models. arXiv preprint arXiv:2208.03299 (2022) Wei et al. [2022] Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E., Le, Q., Zhou, D.: Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903 (2022) Wang et al. [2023] Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-Consistency Improves Chain of Thought Reasoning in Language Models (2023) Shao et al. [2023] Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Patterson, D., Gonzalez, J., Le, Q., Liang, C., Munguia, L.-M., Rothchild, D., So, D., Texier, M., Dean, J.: Carbon emissions and large neural network training. arXiv preprint arXiv:2104.10350 (2021) Wu et al. [2022] Wu, C.-J., Raghavendra, R., Gupta, U., Acun, B., Ardalani, N., Maeng, K., Chang, G., Aga, F., Huang, J., Bai, C., et al.: Sustainable ai: Environmental implications, challenges and opportunities. Proceedings of Machine Learning and Systems 4, 795–813 (2022) Kaack et al. [2022] Kaack, L.H., Donti, P.L., Strubell, E., Kamiya, G., Creutzig, F., Rolnick, D.: Aligning artificial intelligence with climate change mitigation. Nature Climate Change 12(6), 518–527 (2022) Zhou et al. [2022] Zhou, H., Nova, A., Larochelle, H., Courville, A., Neyshabur, B., Sedghi, H.: Teaching algorithmic reasoning via in-context learning. arXiv preprint arXiv:2211.09066 (2022) Izacard et al. [2022] Izacard, G., Lewis, P., Lomeli, M., Hosseini, L., Petroni, F., Schick, T., Dwivedi-Yu, J., Joulin, A., Riedel, S., Grave, E.: Few-shot learning with retrieval augmented language models. arXiv preprint arXiv:2208.03299 (2022) Wei et al. [2022] Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E., Le, Q., Zhou, D.: Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903 (2022) Wang et al. [2023] Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-Consistency Improves Chain of Thought Reasoning in Language Models (2023) Shao et al. [2023] Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Wu, C.-J., Raghavendra, R., Gupta, U., Acun, B., Ardalani, N., Maeng, K., Chang, G., Aga, F., Huang, J., Bai, C., et al.: Sustainable ai: Environmental implications, challenges and opportunities. Proceedings of Machine Learning and Systems 4, 795–813 (2022) Kaack et al. [2022] Kaack, L.H., Donti, P.L., Strubell, E., Kamiya, G., Creutzig, F., Rolnick, D.: Aligning artificial intelligence with climate change mitigation. Nature Climate Change 12(6), 518–527 (2022) Zhou et al. [2022] Zhou, H., Nova, A., Larochelle, H., Courville, A., Neyshabur, B., Sedghi, H.: Teaching algorithmic reasoning via in-context learning. arXiv preprint arXiv:2211.09066 (2022) Izacard et al. [2022] Izacard, G., Lewis, P., Lomeli, M., Hosseini, L., Petroni, F., Schick, T., Dwivedi-Yu, J., Joulin, A., Riedel, S., Grave, E.: Few-shot learning with retrieval augmented language models. arXiv preprint arXiv:2208.03299 (2022) Wei et al. [2022] Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E., Le, Q., Zhou, D.: Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903 (2022) Wang et al. [2023] Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-Consistency Improves Chain of Thought Reasoning in Language Models (2023) Shao et al. [2023] Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Kaack, L.H., Donti, P.L., Strubell, E., Kamiya, G., Creutzig, F., Rolnick, D.: Aligning artificial intelligence with climate change mitigation. Nature Climate Change 12(6), 518–527 (2022) Zhou et al. [2022] Zhou, H., Nova, A., Larochelle, H., Courville, A., Neyshabur, B., Sedghi, H.: Teaching algorithmic reasoning via in-context learning. arXiv preprint arXiv:2211.09066 (2022) Izacard et al. [2022] Izacard, G., Lewis, P., Lomeli, M., Hosseini, L., Petroni, F., Schick, T., Dwivedi-Yu, J., Joulin, A., Riedel, S., Grave, E.: Few-shot learning with retrieval augmented language models. arXiv preprint arXiv:2208.03299 (2022) Wei et al. [2022] Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E., Le, Q., Zhou, D.: Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903 (2022) Wang et al. [2023] Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-Consistency Improves Chain of Thought Reasoning in Language Models (2023) Shao et al. [2023] Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Zhou, H., Nova, A., Larochelle, H., Courville, A., Neyshabur, B., Sedghi, H.: Teaching algorithmic reasoning via in-context learning. arXiv preprint arXiv:2211.09066 (2022) Izacard et al. [2022] Izacard, G., Lewis, P., Lomeli, M., Hosseini, L., Petroni, F., Schick, T., Dwivedi-Yu, J., Joulin, A., Riedel, S., Grave, E.: Few-shot learning with retrieval augmented language models. arXiv preprint arXiv:2208.03299 (2022) Wei et al. [2022] Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E., Le, Q., Zhou, D.: Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903 (2022) Wang et al. [2023] Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-Consistency Improves Chain of Thought Reasoning in Language Models (2023) Shao et al. [2023] Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Izacard, G., Lewis, P., Lomeli, M., Hosseini, L., Petroni, F., Schick, T., Dwivedi-Yu, J., Joulin, A., Riedel, S., Grave, E.: Few-shot learning with retrieval augmented language models. arXiv preprint arXiv:2208.03299 (2022) Wei et al. [2022] Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E., Le, Q., Zhou, D.: Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903 (2022) Wang et al. [2023] Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-Consistency Improves Chain of Thought Reasoning in Language Models (2023) Shao et al. [2023] Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E., Le, Q., Zhou, D.: Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903 (2022) Wang et al. [2023] Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-Consistency Improves Chain of Thought Reasoning in Language Models (2023) Shao et al. [2023] Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-Consistency Improves Chain of Thought Reasoning in Language Models (2023) Shao et al. [2023] Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023)
- O’Neill, M., Vanneschi, L., Gustafson, S., Banzhaf, W.: Open issues in genetic programming. Genetic Programming and Evolvable Machines 11, 339–363 (2010) O’Neill and Spector [2020] O’Neill, M., Spector, L.: Automatic programming: The open issue? Genetic Programming and Evolvable Machines 21, 251–262 (2020) Liu et al. [2023] Liu, P., Yuan, W., Fu, J., Jiang, Z., Hayashi, H., Neubig, G.: Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing. ACM Computing Surveys 55(9), 1–35 (2023) Radford et al. [2019] Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., Sutskever, I., et al.: Language models are unsupervised multitask learners. OpenAI blog 1(8), 9 (2019) Brown et al. [2020] Brown, T.B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D.M., Wu, J., Winter, C., Hesse, C., Chen, M., Sigler, E., Litwin, M., Gray, S., Chess, B., Clark, J., Berner, C., McCandlish, S., Radford, A., Sutskever, I., Amodei, D.: Language Models are Few-Shot Learners (2020) OpenAI [2023] OpenAI: GPT-4 Technical Report (2023) Phuong and Hutter [2022] Phuong, M., Hutter, M.: Formal algorithms for transformers. arXiv preprint arXiv:2207.09238 (2022) Ji et al. [2023] Ji, Z., Lee, N., Frieske, R., Yu, T., Su, D., Xu, Y., Ishii, E., Bang, Y.J., Madotto, A., Fung, P.: Survey of hallucination in natural language generation. ACM Comput. Surv. 55(12) (2023) https://doi.org/10.1145/3571730 Strubell et al. [2020] Strubell, E., Ganesh, A., McCallum, A.: Energy and policy considerations for modern deep learning research. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, pp. 13693–13696 (2020) Patterson et al. [2021] Patterson, D., Gonzalez, J., Le, Q., Liang, C., Munguia, L.-M., Rothchild, D., So, D., Texier, M., Dean, J.: Carbon emissions and large neural network training. arXiv preprint arXiv:2104.10350 (2021) Wu et al. [2022] Wu, C.-J., Raghavendra, R., Gupta, U., Acun, B., Ardalani, N., Maeng, K., Chang, G., Aga, F., Huang, J., Bai, C., et al.: Sustainable ai: Environmental implications, challenges and opportunities. Proceedings of Machine Learning and Systems 4, 795–813 (2022) Kaack et al. [2022] Kaack, L.H., Donti, P.L., Strubell, E., Kamiya, G., Creutzig, F., Rolnick, D.: Aligning artificial intelligence with climate change mitigation. Nature Climate Change 12(6), 518–527 (2022) Zhou et al. [2022] Zhou, H., Nova, A., Larochelle, H., Courville, A., Neyshabur, B., Sedghi, H.: Teaching algorithmic reasoning via in-context learning. arXiv preprint arXiv:2211.09066 (2022) Izacard et al. [2022] Izacard, G., Lewis, P., Lomeli, M., Hosseini, L., Petroni, F., Schick, T., Dwivedi-Yu, J., Joulin, A., Riedel, S., Grave, E.: Few-shot learning with retrieval augmented language models. arXiv preprint arXiv:2208.03299 (2022) Wei et al. [2022] Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E., Le, Q., Zhou, D.: Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903 (2022) Wang et al. [2023] Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-Consistency Improves Chain of Thought Reasoning in Language Models (2023) Shao et al. [2023] Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) O’Neill, M., Spector, L.: Automatic programming: The open issue? Genetic Programming and Evolvable Machines 21, 251–262 (2020) Liu et al. [2023] Liu, P., Yuan, W., Fu, J., Jiang, Z., Hayashi, H., Neubig, G.: Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing. ACM Computing Surveys 55(9), 1–35 (2023) Radford et al. [2019] Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., Sutskever, I., et al.: Language models are unsupervised multitask learners. OpenAI blog 1(8), 9 (2019) Brown et al. [2020] Brown, T.B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D.M., Wu, J., Winter, C., Hesse, C., Chen, M., Sigler, E., Litwin, M., Gray, S., Chess, B., Clark, J., Berner, C., McCandlish, S., Radford, A., Sutskever, I., Amodei, D.: Language Models are Few-Shot Learners (2020) OpenAI [2023] OpenAI: GPT-4 Technical Report (2023) Phuong and Hutter [2022] Phuong, M., Hutter, M.: Formal algorithms for transformers. arXiv preprint arXiv:2207.09238 (2022) Ji et al. [2023] Ji, Z., Lee, N., Frieske, R., Yu, T., Su, D., Xu, Y., Ishii, E., Bang, Y.J., Madotto, A., Fung, P.: Survey of hallucination in natural language generation. ACM Comput. Surv. 55(12) (2023) https://doi.org/10.1145/3571730 Strubell et al. [2020] Strubell, E., Ganesh, A., McCallum, A.: Energy and policy considerations for modern deep learning research. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, pp. 13693–13696 (2020) Patterson et al. [2021] Patterson, D., Gonzalez, J., Le, Q., Liang, C., Munguia, L.-M., Rothchild, D., So, D., Texier, M., Dean, J.: Carbon emissions and large neural network training. arXiv preprint arXiv:2104.10350 (2021) Wu et al. [2022] Wu, C.-J., Raghavendra, R., Gupta, U., Acun, B., Ardalani, N., Maeng, K., Chang, G., Aga, F., Huang, J., Bai, C., et al.: Sustainable ai: Environmental implications, challenges and opportunities. Proceedings of Machine Learning and Systems 4, 795–813 (2022) Kaack et al. [2022] Kaack, L.H., Donti, P.L., Strubell, E., Kamiya, G., Creutzig, F., Rolnick, D.: Aligning artificial intelligence with climate change mitigation. Nature Climate Change 12(6), 518–527 (2022) Zhou et al. [2022] Zhou, H., Nova, A., Larochelle, H., Courville, A., Neyshabur, B., Sedghi, H.: Teaching algorithmic reasoning via in-context learning. arXiv preprint arXiv:2211.09066 (2022) Izacard et al. [2022] Izacard, G., Lewis, P., Lomeli, M., Hosseini, L., Petroni, F., Schick, T., Dwivedi-Yu, J., Joulin, A., Riedel, S., Grave, E.: Few-shot learning with retrieval augmented language models. arXiv preprint arXiv:2208.03299 (2022) Wei et al. [2022] Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E., Le, Q., Zhou, D.: Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903 (2022) Wang et al. [2023] Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-Consistency Improves Chain of Thought Reasoning in Language Models (2023) Shao et al. [2023] Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Liu, P., Yuan, W., Fu, J., Jiang, Z., Hayashi, H., Neubig, G.: Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing. ACM Computing Surveys 55(9), 1–35 (2023) Radford et al. [2019] Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., Sutskever, I., et al.: Language models are unsupervised multitask learners. OpenAI blog 1(8), 9 (2019) Brown et al. [2020] Brown, T.B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D.M., Wu, J., Winter, C., Hesse, C., Chen, M., Sigler, E., Litwin, M., Gray, S., Chess, B., Clark, J., Berner, C., McCandlish, S., Radford, A., Sutskever, I., Amodei, D.: Language Models are Few-Shot Learners (2020) OpenAI [2023] OpenAI: GPT-4 Technical Report (2023) Phuong and Hutter [2022] Phuong, M., Hutter, M.: Formal algorithms for transformers. arXiv preprint arXiv:2207.09238 (2022) Ji et al. [2023] Ji, Z., Lee, N., Frieske, R., Yu, T., Su, D., Xu, Y., Ishii, E., Bang, Y.J., Madotto, A., Fung, P.: Survey of hallucination in natural language generation. ACM Comput. Surv. 55(12) (2023) https://doi.org/10.1145/3571730 Strubell et al. [2020] Strubell, E., Ganesh, A., McCallum, A.: Energy and policy considerations for modern deep learning research. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, pp. 13693–13696 (2020) Patterson et al. [2021] Patterson, D., Gonzalez, J., Le, Q., Liang, C., Munguia, L.-M., Rothchild, D., So, D., Texier, M., Dean, J.: Carbon emissions and large neural network training. arXiv preprint arXiv:2104.10350 (2021) Wu et al. [2022] Wu, C.-J., Raghavendra, R., Gupta, U., Acun, B., Ardalani, N., Maeng, K., Chang, G., Aga, F., Huang, J., Bai, C., et al.: Sustainable ai: Environmental implications, challenges and opportunities. Proceedings of Machine Learning and Systems 4, 795–813 (2022) Kaack et al. [2022] Kaack, L.H., Donti, P.L., Strubell, E., Kamiya, G., Creutzig, F., Rolnick, D.: Aligning artificial intelligence with climate change mitigation. Nature Climate Change 12(6), 518–527 (2022) Zhou et al. [2022] Zhou, H., Nova, A., Larochelle, H., Courville, A., Neyshabur, B., Sedghi, H.: Teaching algorithmic reasoning via in-context learning. arXiv preprint arXiv:2211.09066 (2022) Izacard et al. [2022] Izacard, G., Lewis, P., Lomeli, M., Hosseini, L., Petroni, F., Schick, T., Dwivedi-Yu, J., Joulin, A., Riedel, S., Grave, E.: Few-shot learning with retrieval augmented language models. arXiv preprint arXiv:2208.03299 (2022) Wei et al. [2022] Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E., Le, Q., Zhou, D.: Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903 (2022) Wang et al. [2023] Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-Consistency Improves Chain of Thought Reasoning in Language Models (2023) Shao et al. [2023] Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., Sutskever, I., et al.: Language models are unsupervised multitask learners. OpenAI blog 1(8), 9 (2019) Brown et al. [2020] Brown, T.B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D.M., Wu, J., Winter, C., Hesse, C., Chen, M., Sigler, E., Litwin, M., Gray, S., Chess, B., Clark, J., Berner, C., McCandlish, S., Radford, A., Sutskever, I., Amodei, D.: Language Models are Few-Shot Learners (2020) OpenAI [2023] OpenAI: GPT-4 Technical Report (2023) Phuong and Hutter [2022] Phuong, M., Hutter, M.: Formal algorithms for transformers. arXiv preprint arXiv:2207.09238 (2022) Ji et al. [2023] Ji, Z., Lee, N., Frieske, R., Yu, T., Su, D., Xu, Y., Ishii, E., Bang, Y.J., Madotto, A., Fung, P.: Survey of hallucination in natural language generation. ACM Comput. Surv. 55(12) (2023) https://doi.org/10.1145/3571730 Strubell et al. [2020] Strubell, E., Ganesh, A., McCallum, A.: Energy and policy considerations for modern deep learning research. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, pp. 13693–13696 (2020) Patterson et al. [2021] Patterson, D., Gonzalez, J., Le, Q., Liang, C., Munguia, L.-M., Rothchild, D., So, D., Texier, M., Dean, J.: Carbon emissions and large neural network training. arXiv preprint arXiv:2104.10350 (2021) Wu et al. [2022] Wu, C.-J., Raghavendra, R., Gupta, U., Acun, B., Ardalani, N., Maeng, K., Chang, G., Aga, F., Huang, J., Bai, C., et al.: Sustainable ai: Environmental implications, challenges and opportunities. Proceedings of Machine Learning and Systems 4, 795–813 (2022) Kaack et al. [2022] Kaack, L.H., Donti, P.L., Strubell, E., Kamiya, G., Creutzig, F., Rolnick, D.: Aligning artificial intelligence with climate change mitigation. Nature Climate Change 12(6), 518–527 (2022) Zhou et al. [2022] Zhou, H., Nova, A., Larochelle, H., Courville, A., Neyshabur, B., Sedghi, H.: Teaching algorithmic reasoning via in-context learning. arXiv preprint arXiv:2211.09066 (2022) Izacard et al. [2022] Izacard, G., Lewis, P., Lomeli, M., Hosseini, L., Petroni, F., Schick, T., Dwivedi-Yu, J., Joulin, A., Riedel, S., Grave, E.: Few-shot learning with retrieval augmented language models. arXiv preprint arXiv:2208.03299 (2022) Wei et al. [2022] Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E., Le, Q., Zhou, D.: Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903 (2022) Wang et al. [2023] Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-Consistency Improves Chain of Thought Reasoning in Language Models (2023) Shao et al. [2023] Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Brown, T.B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D.M., Wu, J., Winter, C., Hesse, C., Chen, M., Sigler, E., Litwin, M., Gray, S., Chess, B., Clark, J., Berner, C., McCandlish, S., Radford, A., Sutskever, I., Amodei, D.: Language Models are Few-Shot Learners (2020) OpenAI [2023] OpenAI: GPT-4 Technical Report (2023) Phuong and Hutter [2022] Phuong, M., Hutter, M.: Formal algorithms for transformers. arXiv preprint arXiv:2207.09238 (2022) Ji et al. [2023] Ji, Z., Lee, N., Frieske, R., Yu, T., Su, D., Xu, Y., Ishii, E., Bang, Y.J., Madotto, A., Fung, P.: Survey of hallucination in natural language generation. ACM Comput. Surv. 55(12) (2023) https://doi.org/10.1145/3571730 Strubell et al. [2020] Strubell, E., Ganesh, A., McCallum, A.: Energy and policy considerations for modern deep learning research. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, pp. 13693–13696 (2020) Patterson et al. [2021] Patterson, D., Gonzalez, J., Le, Q., Liang, C., Munguia, L.-M., Rothchild, D., So, D., Texier, M., Dean, J.: Carbon emissions and large neural network training. arXiv preprint arXiv:2104.10350 (2021) Wu et al. [2022] Wu, C.-J., Raghavendra, R., Gupta, U., Acun, B., Ardalani, N., Maeng, K., Chang, G., Aga, F., Huang, J., Bai, C., et al.: Sustainable ai: Environmental implications, challenges and opportunities. Proceedings of Machine Learning and Systems 4, 795–813 (2022) Kaack et al. [2022] Kaack, L.H., Donti, P.L., Strubell, E., Kamiya, G., Creutzig, F., Rolnick, D.: Aligning artificial intelligence with climate change mitigation. Nature Climate Change 12(6), 518–527 (2022) Zhou et al. [2022] Zhou, H., Nova, A., Larochelle, H., Courville, A., Neyshabur, B., Sedghi, H.: Teaching algorithmic reasoning via in-context learning. arXiv preprint arXiv:2211.09066 (2022) Izacard et al. [2022] Izacard, G., Lewis, P., Lomeli, M., Hosseini, L., Petroni, F., Schick, T., Dwivedi-Yu, J., Joulin, A., Riedel, S., Grave, E.: Few-shot learning with retrieval augmented language models. arXiv preprint arXiv:2208.03299 (2022) Wei et al. [2022] Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E., Le, Q., Zhou, D.: Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903 (2022) Wang et al. [2023] Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-Consistency Improves Chain of Thought Reasoning in Language Models (2023) Shao et al. [2023] Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) OpenAI: GPT-4 Technical Report (2023) Phuong and Hutter [2022] Phuong, M., Hutter, M.: Formal algorithms for transformers. arXiv preprint arXiv:2207.09238 (2022) Ji et al. [2023] Ji, Z., Lee, N., Frieske, R., Yu, T., Su, D., Xu, Y., Ishii, E., Bang, Y.J., Madotto, A., Fung, P.: Survey of hallucination in natural language generation. ACM Comput. Surv. 55(12) (2023) https://doi.org/10.1145/3571730 Strubell et al. [2020] Strubell, E., Ganesh, A., McCallum, A.: Energy and policy considerations for modern deep learning research. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, pp. 13693–13696 (2020) Patterson et al. [2021] Patterson, D., Gonzalez, J., Le, Q., Liang, C., Munguia, L.-M., Rothchild, D., So, D., Texier, M., Dean, J.: Carbon emissions and large neural network training. arXiv preprint arXiv:2104.10350 (2021) Wu et al. [2022] Wu, C.-J., Raghavendra, R., Gupta, U., Acun, B., Ardalani, N., Maeng, K., Chang, G., Aga, F., Huang, J., Bai, C., et al.: Sustainable ai: Environmental implications, challenges and opportunities. Proceedings of Machine Learning and Systems 4, 795–813 (2022) Kaack et al. [2022] Kaack, L.H., Donti, P.L., Strubell, E., Kamiya, G., Creutzig, F., Rolnick, D.: Aligning artificial intelligence with climate change mitigation. Nature Climate Change 12(6), 518–527 (2022) Zhou et al. [2022] Zhou, H., Nova, A., Larochelle, H., Courville, A., Neyshabur, B., Sedghi, H.: Teaching algorithmic reasoning via in-context learning. arXiv preprint arXiv:2211.09066 (2022) Izacard et al. [2022] Izacard, G., Lewis, P., Lomeli, M., Hosseini, L., Petroni, F., Schick, T., Dwivedi-Yu, J., Joulin, A., Riedel, S., Grave, E.: Few-shot learning with retrieval augmented language models. arXiv preprint arXiv:2208.03299 (2022) Wei et al. [2022] Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E., Le, Q., Zhou, D.: Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903 (2022) Wang et al. [2023] Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-Consistency Improves Chain of Thought Reasoning in Language Models (2023) Shao et al. [2023] Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Phuong, M., Hutter, M.: Formal algorithms for transformers. arXiv preprint arXiv:2207.09238 (2022) Ji et al. [2023] Ji, Z., Lee, N., Frieske, R., Yu, T., Su, D., Xu, Y., Ishii, E., Bang, Y.J., Madotto, A., Fung, P.: Survey of hallucination in natural language generation. ACM Comput. Surv. 55(12) (2023) https://doi.org/10.1145/3571730 Strubell et al. [2020] Strubell, E., Ganesh, A., McCallum, A.: Energy and policy considerations for modern deep learning research. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, pp. 13693–13696 (2020) Patterson et al. [2021] Patterson, D., Gonzalez, J., Le, Q., Liang, C., Munguia, L.-M., Rothchild, D., So, D., Texier, M., Dean, J.: Carbon emissions and large neural network training. arXiv preprint arXiv:2104.10350 (2021) Wu et al. [2022] Wu, C.-J., Raghavendra, R., Gupta, U., Acun, B., Ardalani, N., Maeng, K., Chang, G., Aga, F., Huang, J., Bai, C., et al.: Sustainable ai: Environmental implications, challenges and opportunities. Proceedings of Machine Learning and Systems 4, 795–813 (2022) Kaack et al. [2022] Kaack, L.H., Donti, P.L., Strubell, E., Kamiya, G., Creutzig, F., Rolnick, D.: Aligning artificial intelligence with climate change mitigation. Nature Climate Change 12(6), 518–527 (2022) Zhou et al. [2022] Zhou, H., Nova, A., Larochelle, H., Courville, A., Neyshabur, B., Sedghi, H.: Teaching algorithmic reasoning via in-context learning. arXiv preprint arXiv:2211.09066 (2022) Izacard et al. [2022] Izacard, G., Lewis, P., Lomeli, M., Hosseini, L., Petroni, F., Schick, T., Dwivedi-Yu, J., Joulin, A., Riedel, S., Grave, E.: Few-shot learning with retrieval augmented language models. arXiv preprint arXiv:2208.03299 (2022) Wei et al. [2022] Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E., Le, Q., Zhou, D.: Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903 (2022) Wang et al. [2023] Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-Consistency Improves Chain of Thought Reasoning in Language Models (2023) Shao et al. [2023] Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Ji, Z., Lee, N., Frieske, R., Yu, T., Su, D., Xu, Y., Ishii, E., Bang, Y.J., Madotto, A., Fung, P.: Survey of hallucination in natural language generation. ACM Comput. Surv. 55(12) (2023) https://doi.org/10.1145/3571730 Strubell et al. [2020] Strubell, E., Ganesh, A., McCallum, A.: Energy and policy considerations for modern deep learning research. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, pp. 13693–13696 (2020) Patterson et al. [2021] Patterson, D., Gonzalez, J., Le, Q., Liang, C., Munguia, L.-M., Rothchild, D., So, D., Texier, M., Dean, J.: Carbon emissions and large neural network training. arXiv preprint arXiv:2104.10350 (2021) Wu et al. [2022] Wu, C.-J., Raghavendra, R., Gupta, U., Acun, B., Ardalani, N., Maeng, K., Chang, G., Aga, F., Huang, J., Bai, C., et al.: Sustainable ai: Environmental implications, challenges and opportunities. Proceedings of Machine Learning and Systems 4, 795–813 (2022) Kaack et al. [2022] Kaack, L.H., Donti, P.L., Strubell, E., Kamiya, G., Creutzig, F., Rolnick, D.: Aligning artificial intelligence with climate change mitigation. Nature Climate Change 12(6), 518–527 (2022) Zhou et al. [2022] Zhou, H., Nova, A., Larochelle, H., Courville, A., Neyshabur, B., Sedghi, H.: Teaching algorithmic reasoning via in-context learning. arXiv preprint arXiv:2211.09066 (2022) Izacard et al. [2022] Izacard, G., Lewis, P., Lomeli, M., Hosseini, L., Petroni, F., Schick, T., Dwivedi-Yu, J., Joulin, A., Riedel, S., Grave, E.: Few-shot learning with retrieval augmented language models. arXiv preprint arXiv:2208.03299 (2022) Wei et al. [2022] Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E., Le, Q., Zhou, D.: Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903 (2022) Wang et al. [2023] Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-Consistency Improves Chain of Thought Reasoning in Language Models (2023) Shao et al. [2023] Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Strubell, E., Ganesh, A., McCallum, A.: Energy and policy considerations for modern deep learning research. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, pp. 13693–13696 (2020) Patterson et al. [2021] Patterson, D., Gonzalez, J., Le, Q., Liang, C., Munguia, L.-M., Rothchild, D., So, D., Texier, M., Dean, J.: Carbon emissions and large neural network training. arXiv preprint arXiv:2104.10350 (2021) Wu et al. [2022] Wu, C.-J., Raghavendra, R., Gupta, U., Acun, B., Ardalani, N., Maeng, K., Chang, G., Aga, F., Huang, J., Bai, C., et al.: Sustainable ai: Environmental implications, challenges and opportunities. Proceedings of Machine Learning and Systems 4, 795–813 (2022) Kaack et al. [2022] Kaack, L.H., Donti, P.L., Strubell, E., Kamiya, G., Creutzig, F., Rolnick, D.: Aligning artificial intelligence with climate change mitigation. Nature Climate Change 12(6), 518–527 (2022) Zhou et al. [2022] Zhou, H., Nova, A., Larochelle, H., Courville, A., Neyshabur, B., Sedghi, H.: Teaching algorithmic reasoning via in-context learning. arXiv preprint arXiv:2211.09066 (2022) Izacard et al. [2022] Izacard, G., Lewis, P., Lomeli, M., Hosseini, L., Petroni, F., Schick, T., Dwivedi-Yu, J., Joulin, A., Riedel, S., Grave, E.: Few-shot learning with retrieval augmented language models. arXiv preprint arXiv:2208.03299 (2022) Wei et al. [2022] Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E., Le, Q., Zhou, D.: Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903 (2022) Wang et al. [2023] Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-Consistency Improves Chain of Thought Reasoning in Language Models (2023) Shao et al. [2023] Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Patterson, D., Gonzalez, J., Le, Q., Liang, C., Munguia, L.-M., Rothchild, D., So, D., Texier, M., Dean, J.: Carbon emissions and large neural network training. arXiv preprint arXiv:2104.10350 (2021) Wu et al. [2022] Wu, C.-J., Raghavendra, R., Gupta, U., Acun, B., Ardalani, N., Maeng, K., Chang, G., Aga, F., Huang, J., Bai, C., et al.: Sustainable ai: Environmental implications, challenges and opportunities. Proceedings of Machine Learning and Systems 4, 795–813 (2022) Kaack et al. [2022] Kaack, L.H., Donti, P.L., Strubell, E., Kamiya, G., Creutzig, F., Rolnick, D.: Aligning artificial intelligence with climate change mitigation. Nature Climate Change 12(6), 518–527 (2022) Zhou et al. [2022] Zhou, H., Nova, A., Larochelle, H., Courville, A., Neyshabur, B., Sedghi, H.: Teaching algorithmic reasoning via in-context learning. arXiv preprint arXiv:2211.09066 (2022) Izacard et al. [2022] Izacard, G., Lewis, P., Lomeli, M., Hosseini, L., Petroni, F., Schick, T., Dwivedi-Yu, J., Joulin, A., Riedel, S., Grave, E.: Few-shot learning with retrieval augmented language models. arXiv preprint arXiv:2208.03299 (2022) Wei et al. [2022] Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E., Le, Q., Zhou, D.: Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903 (2022) Wang et al. [2023] Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-Consistency Improves Chain of Thought Reasoning in Language Models (2023) Shao et al. [2023] Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Wu, C.-J., Raghavendra, R., Gupta, U., Acun, B., Ardalani, N., Maeng, K., Chang, G., Aga, F., Huang, J., Bai, C., et al.: Sustainable ai: Environmental implications, challenges and opportunities. Proceedings of Machine Learning and Systems 4, 795–813 (2022) Kaack et al. [2022] Kaack, L.H., Donti, P.L., Strubell, E., Kamiya, G., Creutzig, F., Rolnick, D.: Aligning artificial intelligence with climate change mitigation. Nature Climate Change 12(6), 518–527 (2022) Zhou et al. [2022] Zhou, H., Nova, A., Larochelle, H., Courville, A., Neyshabur, B., Sedghi, H.: Teaching algorithmic reasoning via in-context learning. arXiv preprint arXiv:2211.09066 (2022) Izacard et al. [2022] Izacard, G., Lewis, P., Lomeli, M., Hosseini, L., Petroni, F., Schick, T., Dwivedi-Yu, J., Joulin, A., Riedel, S., Grave, E.: Few-shot learning with retrieval augmented language models. arXiv preprint arXiv:2208.03299 (2022) Wei et al. [2022] Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E., Le, Q., Zhou, D.: Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903 (2022) Wang et al. [2023] Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-Consistency Improves Chain of Thought Reasoning in Language Models (2023) Shao et al. [2023] Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Kaack, L.H., Donti, P.L., Strubell, E., Kamiya, G., Creutzig, F., Rolnick, D.: Aligning artificial intelligence with climate change mitigation. Nature Climate Change 12(6), 518–527 (2022) Zhou et al. [2022] Zhou, H., Nova, A., Larochelle, H., Courville, A., Neyshabur, B., Sedghi, H.: Teaching algorithmic reasoning via in-context learning. arXiv preprint arXiv:2211.09066 (2022) Izacard et al. [2022] Izacard, G., Lewis, P., Lomeli, M., Hosseini, L., Petroni, F., Schick, T., Dwivedi-Yu, J., Joulin, A., Riedel, S., Grave, E.: Few-shot learning with retrieval augmented language models. arXiv preprint arXiv:2208.03299 (2022) Wei et al. [2022] Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E., Le, Q., Zhou, D.: Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903 (2022) Wang et al. [2023] Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-Consistency Improves Chain of Thought Reasoning in Language Models (2023) Shao et al. [2023] Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Zhou, H., Nova, A., Larochelle, H., Courville, A., Neyshabur, B., Sedghi, H.: Teaching algorithmic reasoning via in-context learning. arXiv preprint arXiv:2211.09066 (2022) Izacard et al. [2022] Izacard, G., Lewis, P., Lomeli, M., Hosseini, L., Petroni, F., Schick, T., Dwivedi-Yu, J., Joulin, A., Riedel, S., Grave, E.: Few-shot learning with retrieval augmented language models. arXiv preprint arXiv:2208.03299 (2022) Wei et al. [2022] Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E., Le, Q., Zhou, D.: Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903 (2022) Wang et al. [2023] Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-Consistency Improves Chain of Thought Reasoning in Language Models (2023) Shao et al. [2023] Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Izacard, G., Lewis, P., Lomeli, M., Hosseini, L., Petroni, F., Schick, T., Dwivedi-Yu, J., Joulin, A., Riedel, S., Grave, E.: Few-shot learning with retrieval augmented language models. arXiv preprint arXiv:2208.03299 (2022) Wei et al. [2022] Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E., Le, Q., Zhou, D.: Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903 (2022) Wang et al. [2023] Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-Consistency Improves Chain of Thought Reasoning in Language Models (2023) Shao et al. [2023] Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E., Le, Q., Zhou, D.: Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903 (2022) Wang et al. [2023] Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-Consistency Improves Chain of Thought Reasoning in Language Models (2023) Shao et al. [2023] Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-Consistency Improves Chain of Thought Reasoning in Language Models (2023) Shao et al. [2023] Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023)
- O’Neill, M., Spector, L.: Automatic programming: The open issue? Genetic Programming and Evolvable Machines 21, 251–262 (2020) Liu et al. [2023] Liu, P., Yuan, W., Fu, J., Jiang, Z., Hayashi, H., Neubig, G.: Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing. ACM Computing Surveys 55(9), 1–35 (2023) Radford et al. [2019] Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., Sutskever, I., et al.: Language models are unsupervised multitask learners. OpenAI blog 1(8), 9 (2019) Brown et al. [2020] Brown, T.B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D.M., Wu, J., Winter, C., Hesse, C., Chen, M., Sigler, E., Litwin, M., Gray, S., Chess, B., Clark, J., Berner, C., McCandlish, S., Radford, A., Sutskever, I., Amodei, D.: Language Models are Few-Shot Learners (2020) OpenAI [2023] OpenAI: GPT-4 Technical Report (2023) Phuong and Hutter [2022] Phuong, M., Hutter, M.: Formal algorithms for transformers. arXiv preprint arXiv:2207.09238 (2022) Ji et al. [2023] Ji, Z., Lee, N., Frieske, R., Yu, T., Su, D., Xu, Y., Ishii, E., Bang, Y.J., Madotto, A., Fung, P.: Survey of hallucination in natural language generation. ACM Comput. Surv. 55(12) (2023) https://doi.org/10.1145/3571730 Strubell et al. [2020] Strubell, E., Ganesh, A., McCallum, A.: Energy and policy considerations for modern deep learning research. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, pp. 13693–13696 (2020) Patterson et al. [2021] Patterson, D., Gonzalez, J., Le, Q., Liang, C., Munguia, L.-M., Rothchild, D., So, D., Texier, M., Dean, J.: Carbon emissions and large neural network training. arXiv preprint arXiv:2104.10350 (2021) Wu et al. [2022] Wu, C.-J., Raghavendra, R., Gupta, U., Acun, B., Ardalani, N., Maeng, K., Chang, G., Aga, F., Huang, J., Bai, C., et al.: Sustainable ai: Environmental implications, challenges and opportunities. Proceedings of Machine Learning and Systems 4, 795–813 (2022) Kaack et al. [2022] Kaack, L.H., Donti, P.L., Strubell, E., Kamiya, G., Creutzig, F., Rolnick, D.: Aligning artificial intelligence with climate change mitigation. Nature Climate Change 12(6), 518–527 (2022) Zhou et al. [2022] Zhou, H., Nova, A., Larochelle, H., Courville, A., Neyshabur, B., Sedghi, H.: Teaching algorithmic reasoning via in-context learning. arXiv preprint arXiv:2211.09066 (2022) Izacard et al. [2022] Izacard, G., Lewis, P., Lomeli, M., Hosseini, L., Petroni, F., Schick, T., Dwivedi-Yu, J., Joulin, A., Riedel, S., Grave, E.: Few-shot learning with retrieval augmented language models. arXiv preprint arXiv:2208.03299 (2022) Wei et al. [2022] Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E., Le, Q., Zhou, D.: Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903 (2022) Wang et al. [2023] Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-Consistency Improves Chain of Thought Reasoning in Language Models (2023) Shao et al. [2023] Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Liu, P., Yuan, W., Fu, J., Jiang, Z., Hayashi, H., Neubig, G.: Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing. ACM Computing Surveys 55(9), 1–35 (2023) Radford et al. [2019] Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., Sutskever, I., et al.: Language models are unsupervised multitask learners. OpenAI blog 1(8), 9 (2019) Brown et al. [2020] Brown, T.B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D.M., Wu, J., Winter, C., Hesse, C., Chen, M., Sigler, E., Litwin, M., Gray, S., Chess, B., Clark, J., Berner, C., McCandlish, S., Radford, A., Sutskever, I., Amodei, D.: Language Models are Few-Shot Learners (2020) OpenAI [2023] OpenAI: GPT-4 Technical Report (2023) Phuong and Hutter [2022] Phuong, M., Hutter, M.: Formal algorithms for transformers. arXiv preprint arXiv:2207.09238 (2022) Ji et al. [2023] Ji, Z., Lee, N., Frieske, R., Yu, T., Su, D., Xu, Y., Ishii, E., Bang, Y.J., Madotto, A., Fung, P.: Survey of hallucination in natural language generation. ACM Comput. Surv. 55(12) (2023) https://doi.org/10.1145/3571730 Strubell et al. [2020] Strubell, E., Ganesh, A., McCallum, A.: Energy and policy considerations for modern deep learning research. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, pp. 13693–13696 (2020) Patterson et al. [2021] Patterson, D., Gonzalez, J., Le, Q., Liang, C., Munguia, L.-M., Rothchild, D., So, D., Texier, M., Dean, J.: Carbon emissions and large neural network training. arXiv preprint arXiv:2104.10350 (2021) Wu et al. [2022] Wu, C.-J., Raghavendra, R., Gupta, U., Acun, B., Ardalani, N., Maeng, K., Chang, G., Aga, F., Huang, J., Bai, C., et al.: Sustainable ai: Environmental implications, challenges and opportunities. Proceedings of Machine Learning and Systems 4, 795–813 (2022) Kaack et al. [2022] Kaack, L.H., Donti, P.L., Strubell, E., Kamiya, G., Creutzig, F., Rolnick, D.: Aligning artificial intelligence with climate change mitigation. Nature Climate Change 12(6), 518–527 (2022) Zhou et al. [2022] Zhou, H., Nova, A., Larochelle, H., Courville, A., Neyshabur, B., Sedghi, H.: Teaching algorithmic reasoning via in-context learning. arXiv preprint arXiv:2211.09066 (2022) Izacard et al. [2022] Izacard, G., Lewis, P., Lomeli, M., Hosseini, L., Petroni, F., Schick, T., Dwivedi-Yu, J., Joulin, A., Riedel, S., Grave, E.: Few-shot learning with retrieval augmented language models. arXiv preprint arXiv:2208.03299 (2022) Wei et al. [2022] Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E., Le, Q., Zhou, D.: Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903 (2022) Wang et al. [2023] Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-Consistency Improves Chain of Thought Reasoning in Language Models (2023) Shao et al. [2023] Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., Sutskever, I., et al.: Language models are unsupervised multitask learners. OpenAI blog 1(8), 9 (2019) Brown et al. [2020] Brown, T.B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D.M., Wu, J., Winter, C., Hesse, C., Chen, M., Sigler, E., Litwin, M., Gray, S., Chess, B., Clark, J., Berner, C., McCandlish, S., Radford, A., Sutskever, I., Amodei, D.: Language Models are Few-Shot Learners (2020) OpenAI [2023] OpenAI: GPT-4 Technical Report (2023) Phuong and Hutter [2022] Phuong, M., Hutter, M.: Formal algorithms for transformers. arXiv preprint arXiv:2207.09238 (2022) Ji et al. [2023] Ji, Z., Lee, N., Frieske, R., Yu, T., Su, D., Xu, Y., Ishii, E., Bang, Y.J., Madotto, A., Fung, P.: Survey of hallucination in natural language generation. ACM Comput. Surv. 55(12) (2023) https://doi.org/10.1145/3571730 Strubell et al. [2020] Strubell, E., Ganesh, A., McCallum, A.: Energy and policy considerations for modern deep learning research. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, pp. 13693–13696 (2020) Patterson et al. [2021] Patterson, D., Gonzalez, J., Le, Q., Liang, C., Munguia, L.-M., Rothchild, D., So, D., Texier, M., Dean, J.: Carbon emissions and large neural network training. arXiv preprint arXiv:2104.10350 (2021) Wu et al. [2022] Wu, C.-J., Raghavendra, R., Gupta, U., Acun, B., Ardalani, N., Maeng, K., Chang, G., Aga, F., Huang, J., Bai, C., et al.: Sustainable ai: Environmental implications, challenges and opportunities. Proceedings of Machine Learning and Systems 4, 795–813 (2022) Kaack et al. [2022] Kaack, L.H., Donti, P.L., Strubell, E., Kamiya, G., Creutzig, F., Rolnick, D.: Aligning artificial intelligence with climate change mitigation. Nature Climate Change 12(6), 518–527 (2022) Zhou et al. [2022] Zhou, H., Nova, A., Larochelle, H., Courville, A., Neyshabur, B., Sedghi, H.: Teaching algorithmic reasoning via in-context learning. arXiv preprint arXiv:2211.09066 (2022) Izacard et al. [2022] Izacard, G., Lewis, P., Lomeli, M., Hosseini, L., Petroni, F., Schick, T., Dwivedi-Yu, J., Joulin, A., Riedel, S., Grave, E.: Few-shot learning with retrieval augmented language models. arXiv preprint arXiv:2208.03299 (2022) Wei et al. [2022] Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E., Le, Q., Zhou, D.: Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903 (2022) Wang et al. [2023] Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-Consistency Improves Chain of Thought Reasoning in Language Models (2023) Shao et al. [2023] Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Brown, T.B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D.M., Wu, J., Winter, C., Hesse, C., Chen, M., Sigler, E., Litwin, M., Gray, S., Chess, B., Clark, J., Berner, C., McCandlish, S., Radford, A., Sutskever, I., Amodei, D.: Language Models are Few-Shot Learners (2020) OpenAI [2023] OpenAI: GPT-4 Technical Report (2023) Phuong and Hutter [2022] Phuong, M., Hutter, M.: Formal algorithms for transformers. arXiv preprint arXiv:2207.09238 (2022) Ji et al. [2023] Ji, Z., Lee, N., Frieske, R., Yu, T., Su, D., Xu, Y., Ishii, E., Bang, Y.J., Madotto, A., Fung, P.: Survey of hallucination in natural language generation. ACM Comput. Surv. 55(12) (2023) https://doi.org/10.1145/3571730 Strubell et al. [2020] Strubell, E., Ganesh, A., McCallum, A.: Energy and policy considerations for modern deep learning research. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, pp. 13693–13696 (2020) Patterson et al. [2021] Patterson, D., Gonzalez, J., Le, Q., Liang, C., Munguia, L.-M., Rothchild, D., So, D., Texier, M., Dean, J.: Carbon emissions and large neural network training. arXiv preprint arXiv:2104.10350 (2021) Wu et al. [2022] Wu, C.-J., Raghavendra, R., Gupta, U., Acun, B., Ardalani, N., Maeng, K., Chang, G., Aga, F., Huang, J., Bai, C., et al.: Sustainable ai: Environmental implications, challenges and opportunities. Proceedings of Machine Learning and Systems 4, 795–813 (2022) Kaack et al. [2022] Kaack, L.H., Donti, P.L., Strubell, E., Kamiya, G., Creutzig, F., Rolnick, D.: Aligning artificial intelligence with climate change mitigation. Nature Climate Change 12(6), 518–527 (2022) Zhou et al. [2022] Zhou, H., Nova, A., Larochelle, H., Courville, A., Neyshabur, B., Sedghi, H.: Teaching algorithmic reasoning via in-context learning. arXiv preprint arXiv:2211.09066 (2022) Izacard et al. [2022] Izacard, G., Lewis, P., Lomeli, M., Hosseini, L., Petroni, F., Schick, T., Dwivedi-Yu, J., Joulin, A., Riedel, S., Grave, E.: Few-shot learning with retrieval augmented language models. arXiv preprint arXiv:2208.03299 (2022) Wei et al. [2022] Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E., Le, Q., Zhou, D.: Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903 (2022) Wang et al. [2023] Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-Consistency Improves Chain of Thought Reasoning in Language Models (2023) Shao et al. [2023] Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) OpenAI: GPT-4 Technical Report (2023) Phuong and Hutter [2022] Phuong, M., Hutter, M.: Formal algorithms for transformers. arXiv preprint arXiv:2207.09238 (2022) Ji et al. [2023] Ji, Z., Lee, N., Frieske, R., Yu, T., Su, D., Xu, Y., Ishii, E., Bang, Y.J., Madotto, A., Fung, P.: Survey of hallucination in natural language generation. ACM Comput. Surv. 55(12) (2023) https://doi.org/10.1145/3571730 Strubell et al. [2020] Strubell, E., Ganesh, A., McCallum, A.: Energy and policy considerations for modern deep learning research. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, pp. 13693–13696 (2020) Patterson et al. [2021] Patterson, D., Gonzalez, J., Le, Q., Liang, C., Munguia, L.-M., Rothchild, D., So, D., Texier, M., Dean, J.: Carbon emissions and large neural network training. arXiv preprint arXiv:2104.10350 (2021) Wu et al. [2022] Wu, C.-J., Raghavendra, R., Gupta, U., Acun, B., Ardalani, N., Maeng, K., Chang, G., Aga, F., Huang, J., Bai, C., et al.: Sustainable ai: Environmental implications, challenges and opportunities. Proceedings of Machine Learning and Systems 4, 795–813 (2022) Kaack et al. [2022] Kaack, L.H., Donti, P.L., Strubell, E., Kamiya, G., Creutzig, F., Rolnick, D.: Aligning artificial intelligence with climate change mitigation. Nature Climate Change 12(6), 518–527 (2022) Zhou et al. [2022] Zhou, H., Nova, A., Larochelle, H., Courville, A., Neyshabur, B., Sedghi, H.: Teaching algorithmic reasoning via in-context learning. arXiv preprint arXiv:2211.09066 (2022) Izacard et al. [2022] Izacard, G., Lewis, P., Lomeli, M., Hosseini, L., Petroni, F., Schick, T., Dwivedi-Yu, J., Joulin, A., Riedel, S., Grave, E.: Few-shot learning with retrieval augmented language models. arXiv preprint arXiv:2208.03299 (2022) Wei et al. [2022] Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E., Le, Q., Zhou, D.: Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903 (2022) Wang et al. [2023] Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-Consistency Improves Chain of Thought Reasoning in Language Models (2023) Shao et al. [2023] Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Phuong, M., Hutter, M.: Formal algorithms for transformers. arXiv preprint arXiv:2207.09238 (2022) Ji et al. [2023] Ji, Z., Lee, N., Frieske, R., Yu, T., Su, D., Xu, Y., Ishii, E., Bang, Y.J., Madotto, A., Fung, P.: Survey of hallucination in natural language generation. ACM Comput. Surv. 55(12) (2023) https://doi.org/10.1145/3571730 Strubell et al. [2020] Strubell, E., Ganesh, A., McCallum, A.: Energy and policy considerations for modern deep learning research. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, pp. 13693–13696 (2020) Patterson et al. [2021] Patterson, D., Gonzalez, J., Le, Q., Liang, C., Munguia, L.-M., Rothchild, D., So, D., Texier, M., Dean, J.: Carbon emissions and large neural network training. arXiv preprint arXiv:2104.10350 (2021) Wu et al. [2022] Wu, C.-J., Raghavendra, R., Gupta, U., Acun, B., Ardalani, N., Maeng, K., Chang, G., Aga, F., Huang, J., Bai, C., et al.: Sustainable ai: Environmental implications, challenges and opportunities. Proceedings of Machine Learning and Systems 4, 795–813 (2022) Kaack et al. [2022] Kaack, L.H., Donti, P.L., Strubell, E., Kamiya, G., Creutzig, F., Rolnick, D.: Aligning artificial intelligence with climate change mitigation. Nature Climate Change 12(6), 518–527 (2022) Zhou et al. [2022] Zhou, H., Nova, A., Larochelle, H., Courville, A., Neyshabur, B., Sedghi, H.: Teaching algorithmic reasoning via in-context learning. arXiv preprint arXiv:2211.09066 (2022) Izacard et al. [2022] Izacard, G., Lewis, P., Lomeli, M., Hosseini, L., Petroni, F., Schick, T., Dwivedi-Yu, J., Joulin, A., Riedel, S., Grave, E.: Few-shot learning with retrieval augmented language models. arXiv preprint arXiv:2208.03299 (2022) Wei et al. [2022] Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E., Le, Q., Zhou, D.: Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903 (2022) Wang et al. [2023] Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-Consistency Improves Chain of Thought Reasoning in Language Models (2023) Shao et al. [2023] Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Ji, Z., Lee, N., Frieske, R., Yu, T., Su, D., Xu, Y., Ishii, E., Bang, Y.J., Madotto, A., Fung, P.: Survey of hallucination in natural language generation. ACM Comput. Surv. 55(12) (2023) https://doi.org/10.1145/3571730 Strubell et al. [2020] Strubell, E., Ganesh, A., McCallum, A.: Energy and policy considerations for modern deep learning research. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, pp. 13693–13696 (2020) Patterson et al. [2021] Patterson, D., Gonzalez, J., Le, Q., Liang, C., Munguia, L.-M., Rothchild, D., So, D., Texier, M., Dean, J.: Carbon emissions and large neural network training. arXiv preprint arXiv:2104.10350 (2021) Wu et al. [2022] Wu, C.-J., Raghavendra, R., Gupta, U., Acun, B., Ardalani, N., Maeng, K., Chang, G., Aga, F., Huang, J., Bai, C., et al.: Sustainable ai: Environmental implications, challenges and opportunities. Proceedings of Machine Learning and Systems 4, 795–813 (2022) Kaack et al. [2022] Kaack, L.H., Donti, P.L., Strubell, E., Kamiya, G., Creutzig, F., Rolnick, D.: Aligning artificial intelligence with climate change mitigation. Nature Climate Change 12(6), 518–527 (2022) Zhou et al. [2022] Zhou, H., Nova, A., Larochelle, H., Courville, A., Neyshabur, B., Sedghi, H.: Teaching algorithmic reasoning via in-context learning. arXiv preprint arXiv:2211.09066 (2022) Izacard et al. [2022] Izacard, G., Lewis, P., Lomeli, M., Hosseini, L., Petroni, F., Schick, T., Dwivedi-Yu, J., Joulin, A., Riedel, S., Grave, E.: Few-shot learning with retrieval augmented language models. arXiv preprint arXiv:2208.03299 (2022) Wei et al. [2022] Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E., Le, Q., Zhou, D.: Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903 (2022) Wang et al. [2023] Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-Consistency Improves Chain of Thought Reasoning in Language Models (2023) Shao et al. [2023] Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Strubell, E., Ganesh, A., McCallum, A.: Energy and policy considerations for modern deep learning research. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, pp. 13693–13696 (2020) Patterson et al. [2021] Patterson, D., Gonzalez, J., Le, Q., Liang, C., Munguia, L.-M., Rothchild, D., So, D., Texier, M., Dean, J.: Carbon emissions and large neural network training. arXiv preprint arXiv:2104.10350 (2021) Wu et al. [2022] Wu, C.-J., Raghavendra, R., Gupta, U., Acun, B., Ardalani, N., Maeng, K., Chang, G., Aga, F., Huang, J., Bai, C., et al.: Sustainable ai: Environmental implications, challenges and opportunities. Proceedings of Machine Learning and Systems 4, 795–813 (2022) Kaack et al. [2022] Kaack, L.H., Donti, P.L., Strubell, E., Kamiya, G., Creutzig, F., Rolnick, D.: Aligning artificial intelligence with climate change mitigation. Nature Climate Change 12(6), 518–527 (2022) Zhou et al. [2022] Zhou, H., Nova, A., Larochelle, H., Courville, A., Neyshabur, B., Sedghi, H.: Teaching algorithmic reasoning via in-context learning. arXiv preprint arXiv:2211.09066 (2022) Izacard et al. [2022] Izacard, G., Lewis, P., Lomeli, M., Hosseini, L., Petroni, F., Schick, T., Dwivedi-Yu, J., Joulin, A., Riedel, S., Grave, E.: Few-shot learning with retrieval augmented language models. arXiv preprint arXiv:2208.03299 (2022) Wei et al. [2022] Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E., Le, Q., Zhou, D.: Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903 (2022) Wang et al. [2023] Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-Consistency Improves Chain of Thought Reasoning in Language Models (2023) Shao et al. [2023] Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Patterson, D., Gonzalez, J., Le, Q., Liang, C., Munguia, L.-M., Rothchild, D., So, D., Texier, M., Dean, J.: Carbon emissions and large neural network training. arXiv preprint arXiv:2104.10350 (2021) Wu et al. [2022] Wu, C.-J., Raghavendra, R., Gupta, U., Acun, B., Ardalani, N., Maeng, K., Chang, G., Aga, F., Huang, J., Bai, C., et al.: Sustainable ai: Environmental implications, challenges and opportunities. Proceedings of Machine Learning and Systems 4, 795–813 (2022) Kaack et al. [2022] Kaack, L.H., Donti, P.L., Strubell, E., Kamiya, G., Creutzig, F., Rolnick, D.: Aligning artificial intelligence with climate change mitigation. Nature Climate Change 12(6), 518–527 (2022) Zhou et al. [2022] Zhou, H., Nova, A., Larochelle, H., Courville, A., Neyshabur, B., Sedghi, H.: Teaching algorithmic reasoning via in-context learning. arXiv preprint arXiv:2211.09066 (2022) Izacard et al. [2022] Izacard, G., Lewis, P., Lomeli, M., Hosseini, L., Petroni, F., Schick, T., Dwivedi-Yu, J., Joulin, A., Riedel, S., Grave, E.: Few-shot learning with retrieval augmented language models. arXiv preprint arXiv:2208.03299 (2022) Wei et al. [2022] Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E., Le, Q., Zhou, D.: Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903 (2022) Wang et al. [2023] Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-Consistency Improves Chain of Thought Reasoning in Language Models (2023) Shao et al. [2023] Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Wu, C.-J., Raghavendra, R., Gupta, U., Acun, B., Ardalani, N., Maeng, K., Chang, G., Aga, F., Huang, J., Bai, C., et al.: Sustainable ai: Environmental implications, challenges and opportunities. Proceedings of Machine Learning and Systems 4, 795–813 (2022) Kaack et al. [2022] Kaack, L.H., Donti, P.L., Strubell, E., Kamiya, G., Creutzig, F., Rolnick, D.: Aligning artificial intelligence with climate change mitigation. Nature Climate Change 12(6), 518–527 (2022) Zhou et al. [2022] Zhou, H., Nova, A., Larochelle, H., Courville, A., Neyshabur, B., Sedghi, H.: Teaching algorithmic reasoning via in-context learning. arXiv preprint arXiv:2211.09066 (2022) Izacard et al. [2022] Izacard, G., Lewis, P., Lomeli, M., Hosseini, L., Petroni, F., Schick, T., Dwivedi-Yu, J., Joulin, A., Riedel, S., Grave, E.: Few-shot learning with retrieval augmented language models. arXiv preprint arXiv:2208.03299 (2022) Wei et al. [2022] Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E., Le, Q., Zhou, D.: Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903 (2022) Wang et al. [2023] Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-Consistency Improves Chain of Thought Reasoning in Language Models (2023) Shao et al. [2023] Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Kaack, L.H., Donti, P.L., Strubell, E., Kamiya, G., Creutzig, F., Rolnick, D.: Aligning artificial intelligence with climate change mitigation. Nature Climate Change 12(6), 518–527 (2022) Zhou et al. [2022] Zhou, H., Nova, A., Larochelle, H., Courville, A., Neyshabur, B., Sedghi, H.: Teaching algorithmic reasoning via in-context learning. arXiv preprint arXiv:2211.09066 (2022) Izacard et al. [2022] Izacard, G., Lewis, P., Lomeli, M., Hosseini, L., Petroni, F., Schick, T., Dwivedi-Yu, J., Joulin, A., Riedel, S., Grave, E.: Few-shot learning with retrieval augmented language models. arXiv preprint arXiv:2208.03299 (2022) Wei et al. [2022] Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E., Le, Q., Zhou, D.: Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903 (2022) Wang et al. [2023] Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-Consistency Improves Chain of Thought Reasoning in Language Models (2023) Shao et al. [2023] Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Zhou, H., Nova, A., Larochelle, H., Courville, A., Neyshabur, B., Sedghi, H.: Teaching algorithmic reasoning via in-context learning. arXiv preprint arXiv:2211.09066 (2022) Izacard et al. [2022] Izacard, G., Lewis, P., Lomeli, M., Hosseini, L., Petroni, F., Schick, T., Dwivedi-Yu, J., Joulin, A., Riedel, S., Grave, E.: Few-shot learning with retrieval augmented language models. arXiv preprint arXiv:2208.03299 (2022) Wei et al. [2022] Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E., Le, Q., Zhou, D.: Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903 (2022) Wang et al. [2023] Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-Consistency Improves Chain of Thought Reasoning in Language Models (2023) Shao et al. [2023] Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Izacard, G., Lewis, P., Lomeli, M., Hosseini, L., Petroni, F., Schick, T., Dwivedi-Yu, J., Joulin, A., Riedel, S., Grave, E.: Few-shot learning with retrieval augmented language models. arXiv preprint arXiv:2208.03299 (2022) Wei et al. [2022] Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E., Le, Q., Zhou, D.: Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903 (2022) Wang et al. [2023] Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-Consistency Improves Chain of Thought Reasoning in Language Models (2023) Shao et al. [2023] Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E., Le, Q., Zhou, D.: Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903 (2022) Wang et al. [2023] Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-Consistency Improves Chain of Thought Reasoning in Language Models (2023) Shao et al. [2023] Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-Consistency Improves Chain of Thought Reasoning in Language Models (2023) Shao et al. [2023] Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023)
- Liu, P., Yuan, W., Fu, J., Jiang, Z., Hayashi, H., Neubig, G.: Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing. ACM Computing Surveys 55(9), 1–35 (2023) Radford et al. [2019] Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., Sutskever, I., et al.: Language models are unsupervised multitask learners. OpenAI blog 1(8), 9 (2019) Brown et al. [2020] Brown, T.B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D.M., Wu, J., Winter, C., Hesse, C., Chen, M., Sigler, E., Litwin, M., Gray, S., Chess, B., Clark, J., Berner, C., McCandlish, S., Radford, A., Sutskever, I., Amodei, D.: Language Models are Few-Shot Learners (2020) OpenAI [2023] OpenAI: GPT-4 Technical Report (2023) Phuong and Hutter [2022] Phuong, M., Hutter, M.: Formal algorithms for transformers. arXiv preprint arXiv:2207.09238 (2022) Ji et al. [2023] Ji, Z., Lee, N., Frieske, R., Yu, T., Su, D., Xu, Y., Ishii, E., Bang, Y.J., Madotto, A., Fung, P.: Survey of hallucination in natural language generation. ACM Comput. Surv. 55(12) (2023) https://doi.org/10.1145/3571730 Strubell et al. [2020] Strubell, E., Ganesh, A., McCallum, A.: Energy and policy considerations for modern deep learning research. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, pp. 13693–13696 (2020) Patterson et al. [2021] Patterson, D., Gonzalez, J., Le, Q., Liang, C., Munguia, L.-M., Rothchild, D., So, D., Texier, M., Dean, J.: Carbon emissions and large neural network training. arXiv preprint arXiv:2104.10350 (2021) Wu et al. [2022] Wu, C.-J., Raghavendra, R., Gupta, U., Acun, B., Ardalani, N., Maeng, K., Chang, G., Aga, F., Huang, J., Bai, C., et al.: Sustainable ai: Environmental implications, challenges and opportunities. Proceedings of Machine Learning and Systems 4, 795–813 (2022) Kaack et al. [2022] Kaack, L.H., Donti, P.L., Strubell, E., Kamiya, G., Creutzig, F., Rolnick, D.: Aligning artificial intelligence with climate change mitigation. Nature Climate Change 12(6), 518–527 (2022) Zhou et al. [2022] Zhou, H., Nova, A., Larochelle, H., Courville, A., Neyshabur, B., Sedghi, H.: Teaching algorithmic reasoning via in-context learning. arXiv preprint arXiv:2211.09066 (2022) Izacard et al. [2022] Izacard, G., Lewis, P., Lomeli, M., Hosseini, L., Petroni, F., Schick, T., Dwivedi-Yu, J., Joulin, A., Riedel, S., Grave, E.: Few-shot learning with retrieval augmented language models. arXiv preprint arXiv:2208.03299 (2022) Wei et al. [2022] Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E., Le, Q., Zhou, D.: Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903 (2022) Wang et al. [2023] Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-Consistency Improves Chain of Thought Reasoning in Language Models (2023) Shao et al. [2023] Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., Sutskever, I., et al.: Language models are unsupervised multitask learners. OpenAI blog 1(8), 9 (2019) Brown et al. [2020] Brown, T.B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D.M., Wu, J., Winter, C., Hesse, C., Chen, M., Sigler, E., Litwin, M., Gray, S., Chess, B., Clark, J., Berner, C., McCandlish, S., Radford, A., Sutskever, I., Amodei, D.: Language Models are Few-Shot Learners (2020) OpenAI [2023] OpenAI: GPT-4 Technical Report (2023) Phuong and Hutter [2022] Phuong, M., Hutter, M.: Formal algorithms for transformers. arXiv preprint arXiv:2207.09238 (2022) Ji et al. [2023] Ji, Z., Lee, N., Frieske, R., Yu, T., Su, D., Xu, Y., Ishii, E., Bang, Y.J., Madotto, A., Fung, P.: Survey of hallucination in natural language generation. ACM Comput. Surv. 55(12) (2023) https://doi.org/10.1145/3571730 Strubell et al. [2020] Strubell, E., Ganesh, A., McCallum, A.: Energy and policy considerations for modern deep learning research. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, pp. 13693–13696 (2020) Patterson et al. [2021] Patterson, D., Gonzalez, J., Le, Q., Liang, C., Munguia, L.-M., Rothchild, D., So, D., Texier, M., Dean, J.: Carbon emissions and large neural network training. arXiv preprint arXiv:2104.10350 (2021) Wu et al. [2022] Wu, C.-J., Raghavendra, R., Gupta, U., Acun, B., Ardalani, N., Maeng, K., Chang, G., Aga, F., Huang, J., Bai, C., et al.: Sustainable ai: Environmental implications, challenges and opportunities. Proceedings of Machine Learning and Systems 4, 795–813 (2022) Kaack et al. [2022] Kaack, L.H., Donti, P.L., Strubell, E., Kamiya, G., Creutzig, F., Rolnick, D.: Aligning artificial intelligence with climate change mitigation. Nature Climate Change 12(6), 518–527 (2022) Zhou et al. [2022] Zhou, H., Nova, A., Larochelle, H., Courville, A., Neyshabur, B., Sedghi, H.: Teaching algorithmic reasoning via in-context learning. arXiv preprint arXiv:2211.09066 (2022) Izacard et al. [2022] Izacard, G., Lewis, P., Lomeli, M., Hosseini, L., Petroni, F., Schick, T., Dwivedi-Yu, J., Joulin, A., Riedel, S., Grave, E.: Few-shot learning with retrieval augmented language models. arXiv preprint arXiv:2208.03299 (2022) Wei et al. [2022] Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E., Le, Q., Zhou, D.: Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903 (2022) Wang et al. [2023] Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-Consistency Improves Chain of Thought Reasoning in Language Models (2023) Shao et al. [2023] Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Brown, T.B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D.M., Wu, J., Winter, C., Hesse, C., Chen, M., Sigler, E., Litwin, M., Gray, S., Chess, B., Clark, J., Berner, C., McCandlish, S., Radford, A., Sutskever, I., Amodei, D.: Language Models are Few-Shot Learners (2020) OpenAI [2023] OpenAI: GPT-4 Technical Report (2023) Phuong and Hutter [2022] Phuong, M., Hutter, M.: Formal algorithms for transformers. arXiv preprint arXiv:2207.09238 (2022) Ji et al. [2023] Ji, Z., Lee, N., Frieske, R., Yu, T., Su, D., Xu, Y., Ishii, E., Bang, Y.J., Madotto, A., Fung, P.: Survey of hallucination in natural language generation. ACM Comput. Surv. 55(12) (2023) https://doi.org/10.1145/3571730 Strubell et al. [2020] Strubell, E., Ganesh, A., McCallum, A.: Energy and policy considerations for modern deep learning research. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, pp. 13693–13696 (2020) Patterson et al. [2021] Patterson, D., Gonzalez, J., Le, Q., Liang, C., Munguia, L.-M., Rothchild, D., So, D., Texier, M., Dean, J.: Carbon emissions and large neural network training. arXiv preprint arXiv:2104.10350 (2021) Wu et al. [2022] Wu, C.-J., Raghavendra, R., Gupta, U., Acun, B., Ardalani, N., Maeng, K., Chang, G., Aga, F., Huang, J., Bai, C., et al.: Sustainable ai: Environmental implications, challenges and opportunities. Proceedings of Machine Learning and Systems 4, 795–813 (2022) Kaack et al. [2022] Kaack, L.H., Donti, P.L., Strubell, E., Kamiya, G., Creutzig, F., Rolnick, D.: Aligning artificial intelligence with climate change mitigation. Nature Climate Change 12(6), 518–527 (2022) Zhou et al. [2022] Zhou, H., Nova, A., Larochelle, H., Courville, A., Neyshabur, B., Sedghi, H.: Teaching algorithmic reasoning via in-context learning. arXiv preprint arXiv:2211.09066 (2022) Izacard et al. [2022] Izacard, G., Lewis, P., Lomeli, M., Hosseini, L., Petroni, F., Schick, T., Dwivedi-Yu, J., Joulin, A., Riedel, S., Grave, E.: Few-shot learning with retrieval augmented language models. arXiv preprint arXiv:2208.03299 (2022) Wei et al. [2022] Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E., Le, Q., Zhou, D.: Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903 (2022) Wang et al. [2023] Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-Consistency Improves Chain of Thought Reasoning in Language Models (2023) Shao et al. [2023] Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) OpenAI: GPT-4 Technical Report (2023) Phuong and Hutter [2022] Phuong, M., Hutter, M.: Formal algorithms for transformers. arXiv preprint arXiv:2207.09238 (2022) Ji et al. [2023] Ji, Z., Lee, N., Frieske, R., Yu, T., Su, D., Xu, Y., Ishii, E., Bang, Y.J., Madotto, A., Fung, P.: Survey of hallucination in natural language generation. ACM Comput. Surv. 55(12) (2023) https://doi.org/10.1145/3571730 Strubell et al. [2020] Strubell, E., Ganesh, A., McCallum, A.: Energy and policy considerations for modern deep learning research. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, pp. 13693–13696 (2020) Patterson et al. [2021] Patterson, D., Gonzalez, J., Le, Q., Liang, C., Munguia, L.-M., Rothchild, D., So, D., Texier, M., Dean, J.: Carbon emissions and large neural network training. arXiv preprint arXiv:2104.10350 (2021) Wu et al. [2022] Wu, C.-J., Raghavendra, R., Gupta, U., Acun, B., Ardalani, N., Maeng, K., Chang, G., Aga, F., Huang, J., Bai, C., et al.: Sustainable ai: Environmental implications, challenges and opportunities. Proceedings of Machine Learning and Systems 4, 795–813 (2022) Kaack et al. [2022] Kaack, L.H., Donti, P.L., Strubell, E., Kamiya, G., Creutzig, F., Rolnick, D.: Aligning artificial intelligence with climate change mitigation. Nature Climate Change 12(6), 518–527 (2022) Zhou et al. [2022] Zhou, H., Nova, A., Larochelle, H., Courville, A., Neyshabur, B., Sedghi, H.: Teaching algorithmic reasoning via in-context learning. arXiv preprint arXiv:2211.09066 (2022) Izacard et al. [2022] Izacard, G., Lewis, P., Lomeli, M., Hosseini, L., Petroni, F., Schick, T., Dwivedi-Yu, J., Joulin, A., Riedel, S., Grave, E.: Few-shot learning with retrieval augmented language models. arXiv preprint arXiv:2208.03299 (2022) Wei et al. [2022] Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E., Le, Q., Zhou, D.: Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903 (2022) Wang et al. [2023] Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-Consistency Improves Chain of Thought Reasoning in Language Models (2023) Shao et al. [2023] Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Phuong, M., Hutter, M.: Formal algorithms for transformers. arXiv preprint arXiv:2207.09238 (2022) Ji et al. [2023] Ji, Z., Lee, N., Frieske, R., Yu, T., Su, D., Xu, Y., Ishii, E., Bang, Y.J., Madotto, A., Fung, P.: Survey of hallucination in natural language generation. ACM Comput. Surv. 55(12) (2023) https://doi.org/10.1145/3571730 Strubell et al. [2020] Strubell, E., Ganesh, A., McCallum, A.: Energy and policy considerations for modern deep learning research. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, pp. 13693–13696 (2020) Patterson et al. [2021] Patterson, D., Gonzalez, J., Le, Q., Liang, C., Munguia, L.-M., Rothchild, D., So, D., Texier, M., Dean, J.: Carbon emissions and large neural network training. arXiv preprint arXiv:2104.10350 (2021) Wu et al. [2022] Wu, C.-J., Raghavendra, R., Gupta, U., Acun, B., Ardalani, N., Maeng, K., Chang, G., Aga, F., Huang, J., Bai, C., et al.: Sustainable ai: Environmental implications, challenges and opportunities. Proceedings of Machine Learning and Systems 4, 795–813 (2022) Kaack et al. [2022] Kaack, L.H., Donti, P.L., Strubell, E., Kamiya, G., Creutzig, F., Rolnick, D.: Aligning artificial intelligence with climate change mitigation. Nature Climate Change 12(6), 518–527 (2022) Zhou et al. [2022] Zhou, H., Nova, A., Larochelle, H., Courville, A., Neyshabur, B., Sedghi, H.: Teaching algorithmic reasoning via in-context learning. arXiv preprint arXiv:2211.09066 (2022) Izacard et al. [2022] Izacard, G., Lewis, P., Lomeli, M., Hosseini, L., Petroni, F., Schick, T., Dwivedi-Yu, J., Joulin, A., Riedel, S., Grave, E.: Few-shot learning with retrieval augmented language models. arXiv preprint arXiv:2208.03299 (2022) Wei et al. [2022] Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E., Le, Q., Zhou, D.: Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903 (2022) Wang et al. [2023] Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-Consistency Improves Chain of Thought Reasoning in Language Models (2023) Shao et al. [2023] Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Ji, Z., Lee, N., Frieske, R., Yu, T., Su, D., Xu, Y., Ishii, E., Bang, Y.J., Madotto, A., Fung, P.: Survey of hallucination in natural language generation. ACM Comput. Surv. 55(12) (2023) https://doi.org/10.1145/3571730 Strubell et al. [2020] Strubell, E., Ganesh, A., McCallum, A.: Energy and policy considerations for modern deep learning research. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, pp. 13693–13696 (2020) Patterson et al. [2021] Patterson, D., Gonzalez, J., Le, Q., Liang, C., Munguia, L.-M., Rothchild, D., So, D., Texier, M., Dean, J.: Carbon emissions and large neural network training. arXiv preprint arXiv:2104.10350 (2021) Wu et al. [2022] Wu, C.-J., Raghavendra, R., Gupta, U., Acun, B., Ardalani, N., Maeng, K., Chang, G., Aga, F., Huang, J., Bai, C., et al.: Sustainable ai: Environmental implications, challenges and opportunities. Proceedings of Machine Learning and Systems 4, 795–813 (2022) Kaack et al. [2022] Kaack, L.H., Donti, P.L., Strubell, E., Kamiya, G., Creutzig, F., Rolnick, D.: Aligning artificial intelligence with climate change mitigation. Nature Climate Change 12(6), 518–527 (2022) Zhou et al. [2022] Zhou, H., Nova, A., Larochelle, H., Courville, A., Neyshabur, B., Sedghi, H.: Teaching algorithmic reasoning via in-context learning. arXiv preprint arXiv:2211.09066 (2022) Izacard et al. [2022] Izacard, G., Lewis, P., Lomeli, M., Hosseini, L., Petroni, F., Schick, T., Dwivedi-Yu, J., Joulin, A., Riedel, S., Grave, E.: Few-shot learning with retrieval augmented language models. arXiv preprint arXiv:2208.03299 (2022) Wei et al. [2022] Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E., Le, Q., Zhou, D.: Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903 (2022) Wang et al. [2023] Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-Consistency Improves Chain of Thought Reasoning in Language Models (2023) Shao et al. [2023] Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Strubell, E., Ganesh, A., McCallum, A.: Energy and policy considerations for modern deep learning research. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, pp. 13693–13696 (2020) Patterson et al. [2021] Patterson, D., Gonzalez, J., Le, Q., Liang, C., Munguia, L.-M., Rothchild, D., So, D., Texier, M., Dean, J.: Carbon emissions and large neural network training. arXiv preprint arXiv:2104.10350 (2021) Wu et al. [2022] Wu, C.-J., Raghavendra, R., Gupta, U., Acun, B., Ardalani, N., Maeng, K., Chang, G., Aga, F., Huang, J., Bai, C., et al.: Sustainable ai: Environmental implications, challenges and opportunities. Proceedings of Machine Learning and Systems 4, 795–813 (2022) Kaack et al. [2022] Kaack, L.H., Donti, P.L., Strubell, E., Kamiya, G., Creutzig, F., Rolnick, D.: Aligning artificial intelligence with climate change mitigation. Nature Climate Change 12(6), 518–527 (2022) Zhou et al. [2022] Zhou, H., Nova, A., Larochelle, H., Courville, A., Neyshabur, B., Sedghi, H.: Teaching algorithmic reasoning via in-context learning. arXiv preprint arXiv:2211.09066 (2022) Izacard et al. [2022] Izacard, G., Lewis, P., Lomeli, M., Hosseini, L., Petroni, F., Schick, T., Dwivedi-Yu, J., Joulin, A., Riedel, S., Grave, E.: Few-shot learning with retrieval augmented language models. arXiv preprint arXiv:2208.03299 (2022) Wei et al. [2022] Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E., Le, Q., Zhou, D.: Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903 (2022) Wang et al. [2023] Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-Consistency Improves Chain of Thought Reasoning in Language Models (2023) Shao et al. [2023] Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Patterson, D., Gonzalez, J., Le, Q., Liang, C., Munguia, L.-M., Rothchild, D., So, D., Texier, M., Dean, J.: Carbon emissions and large neural network training. arXiv preprint arXiv:2104.10350 (2021) Wu et al. [2022] Wu, C.-J., Raghavendra, R., Gupta, U., Acun, B., Ardalani, N., Maeng, K., Chang, G., Aga, F., Huang, J., Bai, C., et al.: Sustainable ai: Environmental implications, challenges and opportunities. Proceedings of Machine Learning and Systems 4, 795–813 (2022) Kaack et al. [2022] Kaack, L.H., Donti, P.L., Strubell, E., Kamiya, G., Creutzig, F., Rolnick, D.: Aligning artificial intelligence with climate change mitigation. Nature Climate Change 12(6), 518–527 (2022) Zhou et al. [2022] Zhou, H., Nova, A., Larochelle, H., Courville, A., Neyshabur, B., Sedghi, H.: Teaching algorithmic reasoning via in-context learning. arXiv preprint arXiv:2211.09066 (2022) Izacard et al. [2022] Izacard, G., Lewis, P., Lomeli, M., Hosseini, L., Petroni, F., Schick, T., Dwivedi-Yu, J., Joulin, A., Riedel, S., Grave, E.: Few-shot learning with retrieval augmented language models. arXiv preprint arXiv:2208.03299 (2022) Wei et al. [2022] Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E., Le, Q., Zhou, D.: Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903 (2022) Wang et al. [2023] Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-Consistency Improves Chain of Thought Reasoning in Language Models (2023) Shao et al. [2023] Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Wu, C.-J., Raghavendra, R., Gupta, U., Acun, B., Ardalani, N., Maeng, K., Chang, G., Aga, F., Huang, J., Bai, C., et al.: Sustainable ai: Environmental implications, challenges and opportunities. Proceedings of Machine Learning and Systems 4, 795–813 (2022) Kaack et al. [2022] Kaack, L.H., Donti, P.L., Strubell, E., Kamiya, G., Creutzig, F., Rolnick, D.: Aligning artificial intelligence with climate change mitigation. Nature Climate Change 12(6), 518–527 (2022) Zhou et al. [2022] Zhou, H., Nova, A., Larochelle, H., Courville, A., Neyshabur, B., Sedghi, H.: Teaching algorithmic reasoning via in-context learning. arXiv preprint arXiv:2211.09066 (2022) Izacard et al. [2022] Izacard, G., Lewis, P., Lomeli, M., Hosseini, L., Petroni, F., Schick, T., Dwivedi-Yu, J., Joulin, A., Riedel, S., Grave, E.: Few-shot learning with retrieval augmented language models. arXiv preprint arXiv:2208.03299 (2022) Wei et al. [2022] Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E., Le, Q., Zhou, D.: Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903 (2022) Wang et al. [2023] Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-Consistency Improves Chain of Thought Reasoning in Language Models (2023) Shao et al. [2023] Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Kaack, L.H., Donti, P.L., Strubell, E., Kamiya, G., Creutzig, F., Rolnick, D.: Aligning artificial intelligence with climate change mitigation. Nature Climate Change 12(6), 518–527 (2022) Zhou et al. [2022] Zhou, H., Nova, A., Larochelle, H., Courville, A., Neyshabur, B., Sedghi, H.: Teaching algorithmic reasoning via in-context learning. arXiv preprint arXiv:2211.09066 (2022) Izacard et al. [2022] Izacard, G., Lewis, P., Lomeli, M., Hosseini, L., Petroni, F., Schick, T., Dwivedi-Yu, J., Joulin, A., Riedel, S., Grave, E.: Few-shot learning with retrieval augmented language models. arXiv preprint arXiv:2208.03299 (2022) Wei et al. [2022] Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E., Le, Q., Zhou, D.: Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903 (2022) Wang et al. [2023] Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-Consistency Improves Chain of Thought Reasoning in Language Models (2023) Shao et al. [2023] Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Zhou, H., Nova, A., Larochelle, H., Courville, A., Neyshabur, B., Sedghi, H.: Teaching algorithmic reasoning via in-context learning. arXiv preprint arXiv:2211.09066 (2022) Izacard et al. [2022] Izacard, G., Lewis, P., Lomeli, M., Hosseini, L., Petroni, F., Schick, T., Dwivedi-Yu, J., Joulin, A., Riedel, S., Grave, E.: Few-shot learning with retrieval augmented language models. arXiv preprint arXiv:2208.03299 (2022) Wei et al. [2022] Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E., Le, Q., Zhou, D.: Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903 (2022) Wang et al. [2023] Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-Consistency Improves Chain of Thought Reasoning in Language Models (2023) Shao et al. [2023] Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Izacard, G., Lewis, P., Lomeli, M., Hosseini, L., Petroni, F., Schick, T., Dwivedi-Yu, J., Joulin, A., Riedel, S., Grave, E.: Few-shot learning with retrieval augmented language models. arXiv preprint arXiv:2208.03299 (2022) Wei et al. [2022] Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E., Le, Q., Zhou, D.: Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903 (2022) Wang et al. [2023] Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-Consistency Improves Chain of Thought Reasoning in Language Models (2023) Shao et al. [2023] Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E., Le, Q., Zhou, D.: Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903 (2022) Wang et al. [2023] Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-Consistency Improves Chain of Thought Reasoning in Language Models (2023) Shao et al. [2023] Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-Consistency Improves Chain of Thought Reasoning in Language Models (2023) Shao et al. [2023] Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023)
- Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., Sutskever, I., et al.: Language models are unsupervised multitask learners. OpenAI blog 1(8), 9 (2019) Brown et al. [2020] Brown, T.B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D.M., Wu, J., Winter, C., Hesse, C., Chen, M., Sigler, E., Litwin, M., Gray, S., Chess, B., Clark, J., Berner, C., McCandlish, S., Radford, A., Sutskever, I., Amodei, D.: Language Models are Few-Shot Learners (2020) OpenAI [2023] OpenAI: GPT-4 Technical Report (2023) Phuong and Hutter [2022] Phuong, M., Hutter, M.: Formal algorithms for transformers. arXiv preprint arXiv:2207.09238 (2022) Ji et al. [2023] Ji, Z., Lee, N., Frieske, R., Yu, T., Su, D., Xu, Y., Ishii, E., Bang, Y.J., Madotto, A., Fung, P.: Survey of hallucination in natural language generation. ACM Comput. Surv. 55(12) (2023) https://doi.org/10.1145/3571730 Strubell et al. [2020] Strubell, E., Ganesh, A., McCallum, A.: Energy and policy considerations for modern deep learning research. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, pp. 13693–13696 (2020) Patterson et al. [2021] Patterson, D., Gonzalez, J., Le, Q., Liang, C., Munguia, L.-M., Rothchild, D., So, D., Texier, M., Dean, J.: Carbon emissions and large neural network training. arXiv preprint arXiv:2104.10350 (2021) Wu et al. [2022] Wu, C.-J., Raghavendra, R., Gupta, U., Acun, B., Ardalani, N., Maeng, K., Chang, G., Aga, F., Huang, J., Bai, C., et al.: Sustainable ai: Environmental implications, challenges and opportunities. Proceedings of Machine Learning and Systems 4, 795–813 (2022) Kaack et al. [2022] Kaack, L.H., Donti, P.L., Strubell, E., Kamiya, G., Creutzig, F., Rolnick, D.: Aligning artificial intelligence with climate change mitigation. Nature Climate Change 12(6), 518–527 (2022) Zhou et al. [2022] Zhou, H., Nova, A., Larochelle, H., Courville, A., Neyshabur, B., Sedghi, H.: Teaching algorithmic reasoning via in-context learning. arXiv preprint arXiv:2211.09066 (2022) Izacard et al. [2022] Izacard, G., Lewis, P., Lomeli, M., Hosseini, L., Petroni, F., Schick, T., Dwivedi-Yu, J., Joulin, A., Riedel, S., Grave, E.: Few-shot learning with retrieval augmented language models. arXiv preprint arXiv:2208.03299 (2022) Wei et al. [2022] Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E., Le, Q., Zhou, D.: Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903 (2022) Wang et al. [2023] Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-Consistency Improves Chain of Thought Reasoning in Language Models (2023) Shao et al. [2023] Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Brown, T.B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D.M., Wu, J., Winter, C., Hesse, C., Chen, M., Sigler, E., Litwin, M., Gray, S., Chess, B., Clark, J., Berner, C., McCandlish, S., Radford, A., Sutskever, I., Amodei, D.: Language Models are Few-Shot Learners (2020) OpenAI [2023] OpenAI: GPT-4 Technical Report (2023) Phuong and Hutter [2022] Phuong, M., Hutter, M.: Formal algorithms for transformers. arXiv preprint arXiv:2207.09238 (2022) Ji et al. [2023] Ji, Z., Lee, N., Frieske, R., Yu, T., Su, D., Xu, Y., Ishii, E., Bang, Y.J., Madotto, A., Fung, P.: Survey of hallucination in natural language generation. ACM Comput. Surv. 55(12) (2023) https://doi.org/10.1145/3571730 Strubell et al. [2020] Strubell, E., Ganesh, A., McCallum, A.: Energy and policy considerations for modern deep learning research. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, pp. 13693–13696 (2020) Patterson et al. [2021] Patterson, D., Gonzalez, J., Le, Q., Liang, C., Munguia, L.-M., Rothchild, D., So, D., Texier, M., Dean, J.: Carbon emissions and large neural network training. arXiv preprint arXiv:2104.10350 (2021) Wu et al. [2022] Wu, C.-J., Raghavendra, R., Gupta, U., Acun, B., Ardalani, N., Maeng, K., Chang, G., Aga, F., Huang, J., Bai, C., et al.: Sustainable ai: Environmental implications, challenges and opportunities. Proceedings of Machine Learning and Systems 4, 795–813 (2022) Kaack et al. [2022] Kaack, L.H., Donti, P.L., Strubell, E., Kamiya, G., Creutzig, F., Rolnick, D.: Aligning artificial intelligence with climate change mitigation. Nature Climate Change 12(6), 518–527 (2022) Zhou et al. [2022] Zhou, H., Nova, A., Larochelle, H., Courville, A., Neyshabur, B., Sedghi, H.: Teaching algorithmic reasoning via in-context learning. arXiv preprint arXiv:2211.09066 (2022) Izacard et al. [2022] Izacard, G., Lewis, P., Lomeli, M., Hosseini, L., Petroni, F., Schick, T., Dwivedi-Yu, J., Joulin, A., Riedel, S., Grave, E.: Few-shot learning with retrieval augmented language models. arXiv preprint arXiv:2208.03299 (2022) Wei et al. [2022] Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E., Le, Q., Zhou, D.: Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903 (2022) Wang et al. [2023] Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-Consistency Improves Chain of Thought Reasoning in Language Models (2023) Shao et al. [2023] Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) OpenAI: GPT-4 Technical Report (2023) Phuong and Hutter [2022] Phuong, M., Hutter, M.: Formal algorithms for transformers. arXiv preprint arXiv:2207.09238 (2022) Ji et al. [2023] Ji, Z., Lee, N., Frieske, R., Yu, T., Su, D., Xu, Y., Ishii, E., Bang, Y.J., Madotto, A., Fung, P.: Survey of hallucination in natural language generation. ACM Comput. Surv. 55(12) (2023) https://doi.org/10.1145/3571730 Strubell et al. [2020] Strubell, E., Ganesh, A., McCallum, A.: Energy and policy considerations for modern deep learning research. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, pp. 13693–13696 (2020) Patterson et al. [2021] Patterson, D., Gonzalez, J., Le, Q., Liang, C., Munguia, L.-M., Rothchild, D., So, D., Texier, M., Dean, J.: Carbon emissions and large neural network training. arXiv preprint arXiv:2104.10350 (2021) Wu et al. [2022] Wu, C.-J., Raghavendra, R., Gupta, U., Acun, B., Ardalani, N., Maeng, K., Chang, G., Aga, F., Huang, J., Bai, C., et al.: Sustainable ai: Environmental implications, challenges and opportunities. Proceedings of Machine Learning and Systems 4, 795–813 (2022) Kaack et al. [2022] Kaack, L.H., Donti, P.L., Strubell, E., Kamiya, G., Creutzig, F., Rolnick, D.: Aligning artificial intelligence with climate change mitigation. Nature Climate Change 12(6), 518–527 (2022) Zhou et al. [2022] Zhou, H., Nova, A., Larochelle, H., Courville, A., Neyshabur, B., Sedghi, H.: Teaching algorithmic reasoning via in-context learning. arXiv preprint arXiv:2211.09066 (2022) Izacard et al. [2022] Izacard, G., Lewis, P., Lomeli, M., Hosseini, L., Petroni, F., Schick, T., Dwivedi-Yu, J., Joulin, A., Riedel, S., Grave, E.: Few-shot learning with retrieval augmented language models. arXiv preprint arXiv:2208.03299 (2022) Wei et al. [2022] Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E., Le, Q., Zhou, D.: Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903 (2022) Wang et al. [2023] Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-Consistency Improves Chain of Thought Reasoning in Language Models (2023) Shao et al. [2023] Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Phuong, M., Hutter, M.: Formal algorithms for transformers. arXiv preprint arXiv:2207.09238 (2022) Ji et al. [2023] Ji, Z., Lee, N., Frieske, R., Yu, T., Su, D., Xu, Y., Ishii, E., Bang, Y.J., Madotto, A., Fung, P.: Survey of hallucination in natural language generation. ACM Comput. Surv. 55(12) (2023) https://doi.org/10.1145/3571730 Strubell et al. [2020] Strubell, E., Ganesh, A., McCallum, A.: Energy and policy considerations for modern deep learning research. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, pp. 13693–13696 (2020) Patterson et al. [2021] Patterson, D., Gonzalez, J., Le, Q., Liang, C., Munguia, L.-M., Rothchild, D., So, D., Texier, M., Dean, J.: Carbon emissions and large neural network training. arXiv preprint arXiv:2104.10350 (2021) Wu et al. [2022] Wu, C.-J., Raghavendra, R., Gupta, U., Acun, B., Ardalani, N., Maeng, K., Chang, G., Aga, F., Huang, J., Bai, C., et al.: Sustainable ai: Environmental implications, challenges and opportunities. Proceedings of Machine Learning and Systems 4, 795–813 (2022) Kaack et al. [2022] Kaack, L.H., Donti, P.L., Strubell, E., Kamiya, G., Creutzig, F., Rolnick, D.: Aligning artificial intelligence with climate change mitigation. Nature Climate Change 12(6), 518–527 (2022) Zhou et al. [2022] Zhou, H., Nova, A., Larochelle, H., Courville, A., Neyshabur, B., Sedghi, H.: Teaching algorithmic reasoning via in-context learning. arXiv preprint arXiv:2211.09066 (2022) Izacard et al. [2022] Izacard, G., Lewis, P., Lomeli, M., Hosseini, L., Petroni, F., Schick, T., Dwivedi-Yu, J., Joulin, A., Riedel, S., Grave, E.: Few-shot learning with retrieval augmented language models. arXiv preprint arXiv:2208.03299 (2022) Wei et al. [2022] Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E., Le, Q., Zhou, D.: Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903 (2022) Wang et al. [2023] Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-Consistency Improves Chain of Thought Reasoning in Language Models (2023) Shao et al. [2023] Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Ji, Z., Lee, N., Frieske, R., Yu, T., Su, D., Xu, Y., Ishii, E., Bang, Y.J., Madotto, A., Fung, P.: Survey of hallucination in natural language generation. ACM Comput. Surv. 55(12) (2023) https://doi.org/10.1145/3571730 Strubell et al. [2020] Strubell, E., Ganesh, A., McCallum, A.: Energy and policy considerations for modern deep learning research. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, pp. 13693–13696 (2020) Patterson et al. [2021] Patterson, D., Gonzalez, J., Le, Q., Liang, C., Munguia, L.-M., Rothchild, D., So, D., Texier, M., Dean, J.: Carbon emissions and large neural network training. arXiv preprint arXiv:2104.10350 (2021) Wu et al. [2022] Wu, C.-J., Raghavendra, R., Gupta, U., Acun, B., Ardalani, N., Maeng, K., Chang, G., Aga, F., Huang, J., Bai, C., et al.: Sustainable ai: Environmental implications, challenges and opportunities. Proceedings of Machine Learning and Systems 4, 795–813 (2022) Kaack et al. [2022] Kaack, L.H., Donti, P.L., Strubell, E., Kamiya, G., Creutzig, F., Rolnick, D.: Aligning artificial intelligence with climate change mitigation. Nature Climate Change 12(6), 518–527 (2022) Zhou et al. [2022] Zhou, H., Nova, A., Larochelle, H., Courville, A., Neyshabur, B., Sedghi, H.: Teaching algorithmic reasoning via in-context learning. arXiv preprint arXiv:2211.09066 (2022) Izacard et al. [2022] Izacard, G., Lewis, P., Lomeli, M., Hosseini, L., Petroni, F., Schick, T., Dwivedi-Yu, J., Joulin, A., Riedel, S., Grave, E.: Few-shot learning with retrieval augmented language models. arXiv preprint arXiv:2208.03299 (2022) Wei et al. [2022] Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E., Le, Q., Zhou, D.: Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903 (2022) Wang et al. [2023] Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-Consistency Improves Chain of Thought Reasoning in Language Models (2023) Shao et al. [2023] Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Strubell, E., Ganesh, A., McCallum, A.: Energy and policy considerations for modern deep learning research. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, pp. 13693–13696 (2020) Patterson et al. [2021] Patterson, D., Gonzalez, J., Le, Q., Liang, C., Munguia, L.-M., Rothchild, D., So, D., Texier, M., Dean, J.: Carbon emissions and large neural network training. arXiv preprint arXiv:2104.10350 (2021) Wu et al. [2022] Wu, C.-J., Raghavendra, R., Gupta, U., Acun, B., Ardalani, N., Maeng, K., Chang, G., Aga, F., Huang, J., Bai, C., et al.: Sustainable ai: Environmental implications, challenges and opportunities. Proceedings of Machine Learning and Systems 4, 795–813 (2022) Kaack et al. [2022] Kaack, L.H., Donti, P.L., Strubell, E., Kamiya, G., Creutzig, F., Rolnick, D.: Aligning artificial intelligence with climate change mitigation. Nature Climate Change 12(6), 518–527 (2022) Zhou et al. [2022] Zhou, H., Nova, A., Larochelle, H., Courville, A., Neyshabur, B., Sedghi, H.: Teaching algorithmic reasoning via in-context learning. arXiv preprint arXiv:2211.09066 (2022) Izacard et al. [2022] Izacard, G., Lewis, P., Lomeli, M., Hosseini, L., Petroni, F., Schick, T., Dwivedi-Yu, J., Joulin, A., Riedel, S., Grave, E.: Few-shot learning with retrieval augmented language models. arXiv preprint arXiv:2208.03299 (2022) Wei et al. [2022] Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E., Le, Q., Zhou, D.: Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903 (2022) Wang et al. [2023] Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-Consistency Improves Chain of Thought Reasoning in Language Models (2023) Shao et al. [2023] Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Patterson, D., Gonzalez, J., Le, Q., Liang, C., Munguia, L.-M., Rothchild, D., So, D., Texier, M., Dean, J.: Carbon emissions and large neural network training. arXiv preprint arXiv:2104.10350 (2021) Wu et al. [2022] Wu, C.-J., Raghavendra, R., Gupta, U., Acun, B., Ardalani, N., Maeng, K., Chang, G., Aga, F., Huang, J., Bai, C., et al.: Sustainable ai: Environmental implications, challenges and opportunities. Proceedings of Machine Learning and Systems 4, 795–813 (2022) Kaack et al. [2022] Kaack, L.H., Donti, P.L., Strubell, E., Kamiya, G., Creutzig, F., Rolnick, D.: Aligning artificial intelligence with climate change mitigation. Nature Climate Change 12(6), 518–527 (2022) Zhou et al. [2022] Zhou, H., Nova, A., Larochelle, H., Courville, A., Neyshabur, B., Sedghi, H.: Teaching algorithmic reasoning via in-context learning. arXiv preprint arXiv:2211.09066 (2022) Izacard et al. [2022] Izacard, G., Lewis, P., Lomeli, M., Hosseini, L., Petroni, F., Schick, T., Dwivedi-Yu, J., Joulin, A., Riedel, S., Grave, E.: Few-shot learning with retrieval augmented language models. arXiv preprint arXiv:2208.03299 (2022) Wei et al. [2022] Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E., Le, Q., Zhou, D.: Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903 (2022) Wang et al. [2023] Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-Consistency Improves Chain of Thought Reasoning in Language Models (2023) Shao et al. [2023] Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Wu, C.-J., Raghavendra, R., Gupta, U., Acun, B., Ardalani, N., Maeng, K., Chang, G., Aga, F., Huang, J., Bai, C., et al.: Sustainable ai: Environmental implications, challenges and opportunities. Proceedings of Machine Learning and Systems 4, 795–813 (2022) Kaack et al. [2022] Kaack, L.H., Donti, P.L., Strubell, E., Kamiya, G., Creutzig, F., Rolnick, D.: Aligning artificial intelligence with climate change mitigation. Nature Climate Change 12(6), 518–527 (2022) Zhou et al. [2022] Zhou, H., Nova, A., Larochelle, H., Courville, A., Neyshabur, B., Sedghi, H.: Teaching algorithmic reasoning via in-context learning. arXiv preprint arXiv:2211.09066 (2022) Izacard et al. [2022] Izacard, G., Lewis, P., Lomeli, M., Hosseini, L., Petroni, F., Schick, T., Dwivedi-Yu, J., Joulin, A., Riedel, S., Grave, E.: Few-shot learning with retrieval augmented language models. arXiv preprint arXiv:2208.03299 (2022) Wei et al. [2022] Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E., Le, Q., Zhou, D.: Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903 (2022) Wang et al. [2023] Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-Consistency Improves Chain of Thought Reasoning in Language Models (2023) Shao et al. [2023] Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Kaack, L.H., Donti, P.L., Strubell, E., Kamiya, G., Creutzig, F., Rolnick, D.: Aligning artificial intelligence with climate change mitigation. Nature Climate Change 12(6), 518–527 (2022) Zhou et al. [2022] Zhou, H., Nova, A., Larochelle, H., Courville, A., Neyshabur, B., Sedghi, H.: Teaching algorithmic reasoning via in-context learning. arXiv preprint arXiv:2211.09066 (2022) Izacard et al. [2022] Izacard, G., Lewis, P., Lomeli, M., Hosseini, L., Petroni, F., Schick, T., Dwivedi-Yu, J., Joulin, A., Riedel, S., Grave, E.: Few-shot learning with retrieval augmented language models. arXiv preprint arXiv:2208.03299 (2022) Wei et al. [2022] Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E., Le, Q., Zhou, D.: Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903 (2022) Wang et al. [2023] Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-Consistency Improves Chain of Thought Reasoning in Language Models (2023) Shao et al. [2023] Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Zhou, H., Nova, A., Larochelle, H., Courville, A., Neyshabur, B., Sedghi, H.: Teaching algorithmic reasoning via in-context learning. arXiv preprint arXiv:2211.09066 (2022) Izacard et al. [2022] Izacard, G., Lewis, P., Lomeli, M., Hosseini, L., Petroni, F., Schick, T., Dwivedi-Yu, J., Joulin, A., Riedel, S., Grave, E.: Few-shot learning with retrieval augmented language models. arXiv preprint arXiv:2208.03299 (2022) Wei et al. [2022] Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E., Le, Q., Zhou, D.: Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903 (2022) Wang et al. [2023] Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-Consistency Improves Chain of Thought Reasoning in Language Models (2023) Shao et al. [2023] Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Izacard, G., Lewis, P., Lomeli, M., Hosseini, L., Petroni, F., Schick, T., Dwivedi-Yu, J., Joulin, A., Riedel, S., Grave, E.: Few-shot learning with retrieval augmented language models. arXiv preprint arXiv:2208.03299 (2022) Wei et al. [2022] Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E., Le, Q., Zhou, D.: Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903 (2022) Wang et al. [2023] Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-Consistency Improves Chain of Thought Reasoning in Language Models (2023) Shao et al. [2023] Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E., Le, Q., Zhou, D.: Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903 (2022) Wang et al. [2023] Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-Consistency Improves Chain of Thought Reasoning in Language Models (2023) Shao et al. [2023] Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-Consistency Improves Chain of Thought Reasoning in Language Models (2023) Shao et al. [2023] Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023)
- Brown, T.B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D.M., Wu, J., Winter, C., Hesse, C., Chen, M., Sigler, E., Litwin, M., Gray, S., Chess, B., Clark, J., Berner, C., McCandlish, S., Radford, A., Sutskever, I., Amodei, D.: Language Models are Few-Shot Learners (2020) OpenAI [2023] OpenAI: GPT-4 Technical Report (2023) Phuong and Hutter [2022] Phuong, M., Hutter, M.: Formal algorithms for transformers. arXiv preprint arXiv:2207.09238 (2022) Ji et al. [2023] Ji, Z., Lee, N., Frieske, R., Yu, T., Su, D., Xu, Y., Ishii, E., Bang, Y.J., Madotto, A., Fung, P.: Survey of hallucination in natural language generation. ACM Comput. Surv. 55(12) (2023) https://doi.org/10.1145/3571730 Strubell et al. [2020] Strubell, E., Ganesh, A., McCallum, A.: Energy and policy considerations for modern deep learning research. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, pp. 13693–13696 (2020) Patterson et al. [2021] Patterson, D., Gonzalez, J., Le, Q., Liang, C., Munguia, L.-M., Rothchild, D., So, D., Texier, M., Dean, J.: Carbon emissions and large neural network training. arXiv preprint arXiv:2104.10350 (2021) Wu et al. [2022] Wu, C.-J., Raghavendra, R., Gupta, U., Acun, B., Ardalani, N., Maeng, K., Chang, G., Aga, F., Huang, J., Bai, C., et al.: Sustainable ai: Environmental implications, challenges and opportunities. Proceedings of Machine Learning and Systems 4, 795–813 (2022) Kaack et al. [2022] Kaack, L.H., Donti, P.L., Strubell, E., Kamiya, G., Creutzig, F., Rolnick, D.: Aligning artificial intelligence with climate change mitigation. Nature Climate Change 12(6), 518–527 (2022) Zhou et al. [2022] Zhou, H., Nova, A., Larochelle, H., Courville, A., Neyshabur, B., Sedghi, H.: Teaching algorithmic reasoning via in-context learning. arXiv preprint arXiv:2211.09066 (2022) Izacard et al. [2022] Izacard, G., Lewis, P., Lomeli, M., Hosseini, L., Petroni, F., Schick, T., Dwivedi-Yu, J., Joulin, A., Riedel, S., Grave, E.: Few-shot learning with retrieval augmented language models. arXiv preprint arXiv:2208.03299 (2022) Wei et al. [2022] Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E., Le, Q., Zhou, D.: Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903 (2022) Wang et al. [2023] Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-Consistency Improves Chain of Thought Reasoning in Language Models (2023) Shao et al. [2023] Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) OpenAI: GPT-4 Technical Report (2023) Phuong and Hutter [2022] Phuong, M., Hutter, M.: Formal algorithms for transformers. arXiv preprint arXiv:2207.09238 (2022) Ji et al. [2023] Ji, Z., Lee, N., Frieske, R., Yu, T., Su, D., Xu, Y., Ishii, E., Bang, Y.J., Madotto, A., Fung, P.: Survey of hallucination in natural language generation. ACM Comput. Surv. 55(12) (2023) https://doi.org/10.1145/3571730 Strubell et al. [2020] Strubell, E., Ganesh, A., McCallum, A.: Energy and policy considerations for modern deep learning research. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, pp. 13693–13696 (2020) Patterson et al. [2021] Patterson, D., Gonzalez, J., Le, Q., Liang, C., Munguia, L.-M., Rothchild, D., So, D., Texier, M., Dean, J.: Carbon emissions and large neural network training. arXiv preprint arXiv:2104.10350 (2021) Wu et al. [2022] Wu, C.-J., Raghavendra, R., Gupta, U., Acun, B., Ardalani, N., Maeng, K., Chang, G., Aga, F., Huang, J., Bai, C., et al.: Sustainable ai: Environmental implications, challenges and opportunities. Proceedings of Machine Learning and Systems 4, 795–813 (2022) Kaack et al. [2022] Kaack, L.H., Donti, P.L., Strubell, E., Kamiya, G., Creutzig, F., Rolnick, D.: Aligning artificial intelligence with climate change mitigation. Nature Climate Change 12(6), 518–527 (2022) Zhou et al. [2022] Zhou, H., Nova, A., Larochelle, H., Courville, A., Neyshabur, B., Sedghi, H.: Teaching algorithmic reasoning via in-context learning. arXiv preprint arXiv:2211.09066 (2022) Izacard et al. [2022] Izacard, G., Lewis, P., Lomeli, M., Hosseini, L., Petroni, F., Schick, T., Dwivedi-Yu, J., Joulin, A., Riedel, S., Grave, E.: Few-shot learning with retrieval augmented language models. arXiv preprint arXiv:2208.03299 (2022) Wei et al. [2022] Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E., Le, Q., Zhou, D.: Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903 (2022) Wang et al. [2023] Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-Consistency Improves Chain of Thought Reasoning in Language Models (2023) Shao et al. [2023] Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Phuong, M., Hutter, M.: Formal algorithms for transformers. arXiv preprint arXiv:2207.09238 (2022) Ji et al. [2023] Ji, Z., Lee, N., Frieske, R., Yu, T., Su, D., Xu, Y., Ishii, E., Bang, Y.J., Madotto, A., Fung, P.: Survey of hallucination in natural language generation. ACM Comput. Surv. 55(12) (2023) https://doi.org/10.1145/3571730 Strubell et al. [2020] Strubell, E., Ganesh, A., McCallum, A.: Energy and policy considerations for modern deep learning research. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, pp. 13693–13696 (2020) Patterson et al. [2021] Patterson, D., Gonzalez, J., Le, Q., Liang, C., Munguia, L.-M., Rothchild, D., So, D., Texier, M., Dean, J.: Carbon emissions and large neural network training. arXiv preprint arXiv:2104.10350 (2021) Wu et al. [2022] Wu, C.-J., Raghavendra, R., Gupta, U., Acun, B., Ardalani, N., Maeng, K., Chang, G., Aga, F., Huang, J., Bai, C., et al.: Sustainable ai: Environmental implications, challenges and opportunities. Proceedings of Machine Learning and Systems 4, 795–813 (2022) Kaack et al. [2022] Kaack, L.H., Donti, P.L., Strubell, E., Kamiya, G., Creutzig, F., Rolnick, D.: Aligning artificial intelligence with climate change mitigation. Nature Climate Change 12(6), 518–527 (2022) Zhou et al. [2022] Zhou, H., Nova, A., Larochelle, H., Courville, A., Neyshabur, B., Sedghi, H.: Teaching algorithmic reasoning via in-context learning. arXiv preprint arXiv:2211.09066 (2022) Izacard et al. [2022] Izacard, G., Lewis, P., Lomeli, M., Hosseini, L., Petroni, F., Schick, T., Dwivedi-Yu, J., Joulin, A., Riedel, S., Grave, E.: Few-shot learning with retrieval augmented language models. arXiv preprint arXiv:2208.03299 (2022) Wei et al. [2022] Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E., Le, Q., Zhou, D.: Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903 (2022) Wang et al. [2023] Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-Consistency Improves Chain of Thought Reasoning in Language Models (2023) Shao et al. [2023] Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Ji, Z., Lee, N., Frieske, R., Yu, T., Su, D., Xu, Y., Ishii, E., Bang, Y.J., Madotto, A., Fung, P.: Survey of hallucination in natural language generation. ACM Comput. Surv. 55(12) (2023) https://doi.org/10.1145/3571730 Strubell et al. [2020] Strubell, E., Ganesh, A., McCallum, A.: Energy and policy considerations for modern deep learning research. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, pp. 13693–13696 (2020) Patterson et al. [2021] Patterson, D., Gonzalez, J., Le, Q., Liang, C., Munguia, L.-M., Rothchild, D., So, D., Texier, M., Dean, J.: Carbon emissions and large neural network training. arXiv preprint arXiv:2104.10350 (2021) Wu et al. [2022] Wu, C.-J., Raghavendra, R., Gupta, U., Acun, B., Ardalani, N., Maeng, K., Chang, G., Aga, F., Huang, J., Bai, C., et al.: Sustainable ai: Environmental implications, challenges and opportunities. Proceedings of Machine Learning and Systems 4, 795–813 (2022) Kaack et al. [2022] Kaack, L.H., Donti, P.L., Strubell, E., Kamiya, G., Creutzig, F., Rolnick, D.: Aligning artificial intelligence with climate change mitigation. Nature Climate Change 12(6), 518–527 (2022) Zhou et al. [2022] Zhou, H., Nova, A., Larochelle, H., Courville, A., Neyshabur, B., Sedghi, H.: Teaching algorithmic reasoning via in-context learning. arXiv preprint arXiv:2211.09066 (2022) Izacard et al. [2022] Izacard, G., Lewis, P., Lomeli, M., Hosseini, L., Petroni, F., Schick, T., Dwivedi-Yu, J., Joulin, A., Riedel, S., Grave, E.: Few-shot learning with retrieval augmented language models. arXiv preprint arXiv:2208.03299 (2022) Wei et al. [2022] Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E., Le, Q., Zhou, D.: Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903 (2022) Wang et al. [2023] Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-Consistency Improves Chain of Thought Reasoning in Language Models (2023) Shao et al. [2023] Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Strubell, E., Ganesh, A., McCallum, A.: Energy and policy considerations for modern deep learning research. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, pp. 13693–13696 (2020) Patterson et al. [2021] Patterson, D., Gonzalez, J., Le, Q., Liang, C., Munguia, L.-M., Rothchild, D., So, D., Texier, M., Dean, J.: Carbon emissions and large neural network training. arXiv preprint arXiv:2104.10350 (2021) Wu et al. [2022] Wu, C.-J., Raghavendra, R., Gupta, U., Acun, B., Ardalani, N., Maeng, K., Chang, G., Aga, F., Huang, J., Bai, C., et al.: Sustainable ai: Environmental implications, challenges and opportunities. Proceedings of Machine Learning and Systems 4, 795–813 (2022) Kaack et al. [2022] Kaack, L.H., Donti, P.L., Strubell, E., Kamiya, G., Creutzig, F., Rolnick, D.: Aligning artificial intelligence with climate change mitigation. Nature Climate Change 12(6), 518–527 (2022) Zhou et al. [2022] Zhou, H., Nova, A., Larochelle, H., Courville, A., Neyshabur, B., Sedghi, H.: Teaching algorithmic reasoning via in-context learning. arXiv preprint arXiv:2211.09066 (2022) Izacard et al. [2022] Izacard, G., Lewis, P., Lomeli, M., Hosseini, L., Petroni, F., Schick, T., Dwivedi-Yu, J., Joulin, A., Riedel, S., Grave, E.: Few-shot learning with retrieval augmented language models. arXiv preprint arXiv:2208.03299 (2022) Wei et al. [2022] Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E., Le, Q., Zhou, D.: Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903 (2022) Wang et al. [2023] Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-Consistency Improves Chain of Thought Reasoning in Language Models (2023) Shao et al. [2023] Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Patterson, D., Gonzalez, J., Le, Q., Liang, C., Munguia, L.-M., Rothchild, D., So, D., Texier, M., Dean, J.: Carbon emissions and large neural network training. arXiv preprint arXiv:2104.10350 (2021) Wu et al. [2022] Wu, C.-J., Raghavendra, R., Gupta, U., Acun, B., Ardalani, N., Maeng, K., Chang, G., Aga, F., Huang, J., Bai, C., et al.: Sustainable ai: Environmental implications, challenges and opportunities. Proceedings of Machine Learning and Systems 4, 795–813 (2022) Kaack et al. [2022] Kaack, L.H., Donti, P.L., Strubell, E., Kamiya, G., Creutzig, F., Rolnick, D.: Aligning artificial intelligence with climate change mitigation. Nature Climate Change 12(6), 518–527 (2022) Zhou et al. [2022] Zhou, H., Nova, A., Larochelle, H., Courville, A., Neyshabur, B., Sedghi, H.: Teaching algorithmic reasoning via in-context learning. arXiv preprint arXiv:2211.09066 (2022) Izacard et al. [2022] Izacard, G., Lewis, P., Lomeli, M., Hosseini, L., Petroni, F., Schick, T., Dwivedi-Yu, J., Joulin, A., Riedel, S., Grave, E.: Few-shot learning with retrieval augmented language models. arXiv preprint arXiv:2208.03299 (2022) Wei et al. [2022] Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E., Le, Q., Zhou, D.: Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903 (2022) Wang et al. [2023] Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-Consistency Improves Chain of Thought Reasoning in Language Models (2023) Shao et al. [2023] Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Wu, C.-J., Raghavendra, R., Gupta, U., Acun, B., Ardalani, N., Maeng, K., Chang, G., Aga, F., Huang, J., Bai, C., et al.: Sustainable ai: Environmental implications, challenges and opportunities. Proceedings of Machine Learning and Systems 4, 795–813 (2022) Kaack et al. [2022] Kaack, L.H., Donti, P.L., Strubell, E., Kamiya, G., Creutzig, F., Rolnick, D.: Aligning artificial intelligence with climate change mitigation. Nature Climate Change 12(6), 518–527 (2022) Zhou et al. [2022] Zhou, H., Nova, A., Larochelle, H., Courville, A., Neyshabur, B., Sedghi, H.: Teaching algorithmic reasoning via in-context learning. arXiv preprint arXiv:2211.09066 (2022) Izacard et al. [2022] Izacard, G., Lewis, P., Lomeli, M., Hosseini, L., Petroni, F., Schick, T., Dwivedi-Yu, J., Joulin, A., Riedel, S., Grave, E.: Few-shot learning with retrieval augmented language models. arXiv preprint arXiv:2208.03299 (2022) Wei et al. [2022] Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E., Le, Q., Zhou, D.: Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903 (2022) Wang et al. [2023] Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-Consistency Improves Chain of Thought Reasoning in Language Models (2023) Shao et al. [2023] Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Kaack, L.H., Donti, P.L., Strubell, E., Kamiya, G., Creutzig, F., Rolnick, D.: Aligning artificial intelligence with climate change mitigation. Nature Climate Change 12(6), 518–527 (2022) Zhou et al. [2022] Zhou, H., Nova, A., Larochelle, H., Courville, A., Neyshabur, B., Sedghi, H.: Teaching algorithmic reasoning via in-context learning. arXiv preprint arXiv:2211.09066 (2022) Izacard et al. [2022] Izacard, G., Lewis, P., Lomeli, M., Hosseini, L., Petroni, F., Schick, T., Dwivedi-Yu, J., Joulin, A., Riedel, S., Grave, E.: Few-shot learning with retrieval augmented language models. arXiv preprint arXiv:2208.03299 (2022) Wei et al. [2022] Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E., Le, Q., Zhou, D.: Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903 (2022) Wang et al. [2023] Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-Consistency Improves Chain of Thought Reasoning in Language Models (2023) Shao et al. [2023] Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Zhou, H., Nova, A., Larochelle, H., Courville, A., Neyshabur, B., Sedghi, H.: Teaching algorithmic reasoning via in-context learning. arXiv preprint arXiv:2211.09066 (2022) Izacard et al. [2022] Izacard, G., Lewis, P., Lomeli, M., Hosseini, L., Petroni, F., Schick, T., Dwivedi-Yu, J., Joulin, A., Riedel, S., Grave, E.: Few-shot learning with retrieval augmented language models. arXiv preprint arXiv:2208.03299 (2022) Wei et al. [2022] Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E., Le, Q., Zhou, D.: Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903 (2022) Wang et al. [2023] Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-Consistency Improves Chain of Thought Reasoning in Language Models (2023) Shao et al. [2023] Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Izacard, G., Lewis, P., Lomeli, M., Hosseini, L., Petroni, F., Schick, T., Dwivedi-Yu, J., Joulin, A., Riedel, S., Grave, E.: Few-shot learning with retrieval augmented language models. arXiv preprint arXiv:2208.03299 (2022) Wei et al. [2022] Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E., Le, Q., Zhou, D.: Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903 (2022) Wang et al. [2023] Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-Consistency Improves Chain of Thought Reasoning in Language Models (2023) Shao et al. [2023] Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E., Le, Q., Zhou, D.: Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903 (2022) Wang et al. [2023] Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-Consistency Improves Chain of Thought Reasoning in Language Models (2023) Shao et al. [2023] Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-Consistency Improves Chain of Thought Reasoning in Language Models (2023) Shao et al. [2023] Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023)
- OpenAI: GPT-4 Technical Report (2023) Phuong and Hutter [2022] Phuong, M., Hutter, M.: Formal algorithms for transformers. arXiv preprint arXiv:2207.09238 (2022) Ji et al. [2023] Ji, Z., Lee, N., Frieske, R., Yu, T., Su, D., Xu, Y., Ishii, E., Bang, Y.J., Madotto, A., Fung, P.: Survey of hallucination in natural language generation. ACM Comput. Surv. 55(12) (2023) https://doi.org/10.1145/3571730 Strubell et al. [2020] Strubell, E., Ganesh, A., McCallum, A.: Energy and policy considerations for modern deep learning research. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, pp. 13693–13696 (2020) Patterson et al. [2021] Patterson, D., Gonzalez, J., Le, Q., Liang, C., Munguia, L.-M., Rothchild, D., So, D., Texier, M., Dean, J.: Carbon emissions and large neural network training. arXiv preprint arXiv:2104.10350 (2021) Wu et al. [2022] Wu, C.-J., Raghavendra, R., Gupta, U., Acun, B., Ardalani, N., Maeng, K., Chang, G., Aga, F., Huang, J., Bai, C., et al.: Sustainable ai: Environmental implications, challenges and opportunities. Proceedings of Machine Learning and Systems 4, 795–813 (2022) Kaack et al. [2022] Kaack, L.H., Donti, P.L., Strubell, E., Kamiya, G., Creutzig, F., Rolnick, D.: Aligning artificial intelligence with climate change mitigation. Nature Climate Change 12(6), 518–527 (2022) Zhou et al. [2022] Zhou, H., Nova, A., Larochelle, H., Courville, A., Neyshabur, B., Sedghi, H.: Teaching algorithmic reasoning via in-context learning. arXiv preprint arXiv:2211.09066 (2022) Izacard et al. [2022] Izacard, G., Lewis, P., Lomeli, M., Hosseini, L., Petroni, F., Schick, T., Dwivedi-Yu, J., Joulin, A., Riedel, S., Grave, E.: Few-shot learning with retrieval augmented language models. arXiv preprint arXiv:2208.03299 (2022) Wei et al. [2022] Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E., Le, Q., Zhou, D.: Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903 (2022) Wang et al. [2023] Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-Consistency Improves Chain of Thought Reasoning in Language Models (2023) Shao et al. [2023] Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Phuong, M., Hutter, M.: Formal algorithms for transformers. arXiv preprint arXiv:2207.09238 (2022) Ji et al. [2023] Ji, Z., Lee, N., Frieske, R., Yu, T., Su, D., Xu, Y., Ishii, E., Bang, Y.J., Madotto, A., Fung, P.: Survey of hallucination in natural language generation. ACM Comput. Surv. 55(12) (2023) https://doi.org/10.1145/3571730 Strubell et al. [2020] Strubell, E., Ganesh, A., McCallum, A.: Energy and policy considerations for modern deep learning research. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, pp. 13693–13696 (2020) Patterson et al. [2021] Patterson, D., Gonzalez, J., Le, Q., Liang, C., Munguia, L.-M., Rothchild, D., So, D., Texier, M., Dean, J.: Carbon emissions and large neural network training. arXiv preprint arXiv:2104.10350 (2021) Wu et al. [2022] Wu, C.-J., Raghavendra, R., Gupta, U., Acun, B., Ardalani, N., Maeng, K., Chang, G., Aga, F., Huang, J., Bai, C., et al.: Sustainable ai: Environmental implications, challenges and opportunities. Proceedings of Machine Learning and Systems 4, 795–813 (2022) Kaack et al. [2022] Kaack, L.H., Donti, P.L., Strubell, E., Kamiya, G., Creutzig, F., Rolnick, D.: Aligning artificial intelligence with climate change mitigation. Nature Climate Change 12(6), 518–527 (2022) Zhou et al. [2022] Zhou, H., Nova, A., Larochelle, H., Courville, A., Neyshabur, B., Sedghi, H.: Teaching algorithmic reasoning via in-context learning. arXiv preprint arXiv:2211.09066 (2022) Izacard et al. [2022] Izacard, G., Lewis, P., Lomeli, M., Hosseini, L., Petroni, F., Schick, T., Dwivedi-Yu, J., Joulin, A., Riedel, S., Grave, E.: Few-shot learning with retrieval augmented language models. arXiv preprint arXiv:2208.03299 (2022) Wei et al. [2022] Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E., Le, Q., Zhou, D.: Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903 (2022) Wang et al. [2023] Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-Consistency Improves Chain of Thought Reasoning in Language Models (2023) Shao et al. [2023] Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Ji, Z., Lee, N., Frieske, R., Yu, T., Su, D., Xu, Y., Ishii, E., Bang, Y.J., Madotto, A., Fung, P.: Survey of hallucination in natural language generation. ACM Comput. Surv. 55(12) (2023) https://doi.org/10.1145/3571730 Strubell et al. [2020] Strubell, E., Ganesh, A., McCallum, A.: Energy and policy considerations for modern deep learning research. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, pp. 13693–13696 (2020) Patterson et al. [2021] Patterson, D., Gonzalez, J., Le, Q., Liang, C., Munguia, L.-M., Rothchild, D., So, D., Texier, M., Dean, J.: Carbon emissions and large neural network training. arXiv preprint arXiv:2104.10350 (2021) Wu et al. [2022] Wu, C.-J., Raghavendra, R., Gupta, U., Acun, B., Ardalani, N., Maeng, K., Chang, G., Aga, F., Huang, J., Bai, C., et al.: Sustainable ai: Environmental implications, challenges and opportunities. Proceedings of Machine Learning and Systems 4, 795–813 (2022) Kaack et al. [2022] Kaack, L.H., Donti, P.L., Strubell, E., Kamiya, G., Creutzig, F., Rolnick, D.: Aligning artificial intelligence with climate change mitigation. Nature Climate Change 12(6), 518–527 (2022) Zhou et al. [2022] Zhou, H., Nova, A., Larochelle, H., Courville, A., Neyshabur, B., Sedghi, H.: Teaching algorithmic reasoning via in-context learning. arXiv preprint arXiv:2211.09066 (2022) Izacard et al. [2022] Izacard, G., Lewis, P., Lomeli, M., Hosseini, L., Petroni, F., Schick, T., Dwivedi-Yu, J., Joulin, A., Riedel, S., Grave, E.: Few-shot learning with retrieval augmented language models. arXiv preprint arXiv:2208.03299 (2022) Wei et al. [2022] Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E., Le, Q., Zhou, D.: Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903 (2022) Wang et al. [2023] Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-Consistency Improves Chain of Thought Reasoning in Language Models (2023) Shao et al. [2023] Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Strubell, E., Ganesh, A., McCallum, A.: Energy and policy considerations for modern deep learning research. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, pp. 13693–13696 (2020) Patterson et al. [2021] Patterson, D., Gonzalez, J., Le, Q., Liang, C., Munguia, L.-M., Rothchild, D., So, D., Texier, M., Dean, J.: Carbon emissions and large neural network training. arXiv preprint arXiv:2104.10350 (2021) Wu et al. [2022] Wu, C.-J., Raghavendra, R., Gupta, U., Acun, B., Ardalani, N., Maeng, K., Chang, G., Aga, F., Huang, J., Bai, C., et al.: Sustainable ai: Environmental implications, challenges and opportunities. Proceedings of Machine Learning and Systems 4, 795–813 (2022) Kaack et al. [2022] Kaack, L.H., Donti, P.L., Strubell, E., Kamiya, G., Creutzig, F., Rolnick, D.: Aligning artificial intelligence with climate change mitigation. Nature Climate Change 12(6), 518–527 (2022) Zhou et al. [2022] Zhou, H., Nova, A., Larochelle, H., Courville, A., Neyshabur, B., Sedghi, H.: Teaching algorithmic reasoning via in-context learning. arXiv preprint arXiv:2211.09066 (2022) Izacard et al. [2022] Izacard, G., Lewis, P., Lomeli, M., Hosseini, L., Petroni, F., Schick, T., Dwivedi-Yu, J., Joulin, A., Riedel, S., Grave, E.: Few-shot learning with retrieval augmented language models. arXiv preprint arXiv:2208.03299 (2022) Wei et al. [2022] Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E., Le, Q., Zhou, D.: Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903 (2022) Wang et al. [2023] Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-Consistency Improves Chain of Thought Reasoning in Language Models (2023) Shao et al. [2023] Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Patterson, D., Gonzalez, J., Le, Q., Liang, C., Munguia, L.-M., Rothchild, D., So, D., Texier, M., Dean, J.: Carbon emissions and large neural network training. arXiv preprint arXiv:2104.10350 (2021) Wu et al. [2022] Wu, C.-J., Raghavendra, R., Gupta, U., Acun, B., Ardalani, N., Maeng, K., Chang, G., Aga, F., Huang, J., Bai, C., et al.: Sustainable ai: Environmental implications, challenges and opportunities. Proceedings of Machine Learning and Systems 4, 795–813 (2022) Kaack et al. [2022] Kaack, L.H., Donti, P.L., Strubell, E., Kamiya, G., Creutzig, F., Rolnick, D.: Aligning artificial intelligence with climate change mitigation. Nature Climate Change 12(6), 518–527 (2022) Zhou et al. [2022] Zhou, H., Nova, A., Larochelle, H., Courville, A., Neyshabur, B., Sedghi, H.: Teaching algorithmic reasoning via in-context learning. arXiv preprint arXiv:2211.09066 (2022) Izacard et al. [2022] Izacard, G., Lewis, P., Lomeli, M., Hosseini, L., Petroni, F., Schick, T., Dwivedi-Yu, J., Joulin, A., Riedel, S., Grave, E.: Few-shot learning with retrieval augmented language models. arXiv preprint arXiv:2208.03299 (2022) Wei et al. [2022] Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E., Le, Q., Zhou, D.: Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903 (2022) Wang et al. [2023] Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-Consistency Improves Chain of Thought Reasoning in Language Models (2023) Shao et al. [2023] Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Wu, C.-J., Raghavendra, R., Gupta, U., Acun, B., Ardalani, N., Maeng, K., Chang, G., Aga, F., Huang, J., Bai, C., et al.: Sustainable ai: Environmental implications, challenges and opportunities. Proceedings of Machine Learning and Systems 4, 795–813 (2022) Kaack et al. [2022] Kaack, L.H., Donti, P.L., Strubell, E., Kamiya, G., Creutzig, F., Rolnick, D.: Aligning artificial intelligence with climate change mitigation. Nature Climate Change 12(6), 518–527 (2022) Zhou et al. [2022] Zhou, H., Nova, A., Larochelle, H., Courville, A., Neyshabur, B., Sedghi, H.: Teaching algorithmic reasoning via in-context learning. arXiv preprint arXiv:2211.09066 (2022) Izacard et al. [2022] Izacard, G., Lewis, P., Lomeli, M., Hosseini, L., Petroni, F., Schick, T., Dwivedi-Yu, J., Joulin, A., Riedel, S., Grave, E.: Few-shot learning with retrieval augmented language models. arXiv preprint arXiv:2208.03299 (2022) Wei et al. [2022] Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E., Le, Q., Zhou, D.: Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903 (2022) Wang et al. [2023] Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-Consistency Improves Chain of Thought Reasoning in Language Models (2023) Shao et al. [2023] Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Kaack, L.H., Donti, P.L., Strubell, E., Kamiya, G., Creutzig, F., Rolnick, D.: Aligning artificial intelligence with climate change mitigation. Nature Climate Change 12(6), 518–527 (2022) Zhou et al. [2022] Zhou, H., Nova, A., Larochelle, H., Courville, A., Neyshabur, B., Sedghi, H.: Teaching algorithmic reasoning via in-context learning. arXiv preprint arXiv:2211.09066 (2022) Izacard et al. [2022] Izacard, G., Lewis, P., Lomeli, M., Hosseini, L., Petroni, F., Schick, T., Dwivedi-Yu, J., Joulin, A., Riedel, S., Grave, E.: Few-shot learning with retrieval augmented language models. arXiv preprint arXiv:2208.03299 (2022) Wei et al. [2022] Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E., Le, Q., Zhou, D.: Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903 (2022) Wang et al. [2023] Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-Consistency Improves Chain of Thought Reasoning in Language Models (2023) Shao et al. [2023] Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Zhou, H., Nova, A., Larochelle, H., Courville, A., Neyshabur, B., Sedghi, H.: Teaching algorithmic reasoning via in-context learning. arXiv preprint arXiv:2211.09066 (2022) Izacard et al. [2022] Izacard, G., Lewis, P., Lomeli, M., Hosseini, L., Petroni, F., Schick, T., Dwivedi-Yu, J., Joulin, A., Riedel, S., Grave, E.: Few-shot learning with retrieval augmented language models. arXiv preprint arXiv:2208.03299 (2022) Wei et al. [2022] Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E., Le, Q., Zhou, D.: Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903 (2022) Wang et al. [2023] Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-Consistency Improves Chain of Thought Reasoning in Language Models (2023) Shao et al. [2023] Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Izacard, G., Lewis, P., Lomeli, M., Hosseini, L., Petroni, F., Schick, T., Dwivedi-Yu, J., Joulin, A., Riedel, S., Grave, E.: Few-shot learning with retrieval augmented language models. arXiv preprint arXiv:2208.03299 (2022) Wei et al. [2022] Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E., Le, Q., Zhou, D.: Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903 (2022) Wang et al. [2023] Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-Consistency Improves Chain of Thought Reasoning in Language Models (2023) Shao et al. [2023] Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E., Le, Q., Zhou, D.: Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903 (2022) Wang et al. [2023] Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-Consistency Improves Chain of Thought Reasoning in Language Models (2023) Shao et al. [2023] Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-Consistency Improves Chain of Thought Reasoning in Language Models (2023) Shao et al. [2023] Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023)
- Phuong, M., Hutter, M.: Formal algorithms for transformers. arXiv preprint arXiv:2207.09238 (2022) Ji et al. [2023] Ji, Z., Lee, N., Frieske, R., Yu, T., Su, D., Xu, Y., Ishii, E., Bang, Y.J., Madotto, A., Fung, P.: Survey of hallucination in natural language generation. ACM Comput. Surv. 55(12) (2023) https://doi.org/10.1145/3571730 Strubell et al. [2020] Strubell, E., Ganesh, A., McCallum, A.: Energy and policy considerations for modern deep learning research. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, pp. 13693–13696 (2020) Patterson et al. [2021] Patterson, D., Gonzalez, J., Le, Q., Liang, C., Munguia, L.-M., Rothchild, D., So, D., Texier, M., Dean, J.: Carbon emissions and large neural network training. arXiv preprint arXiv:2104.10350 (2021) Wu et al. [2022] Wu, C.-J., Raghavendra, R., Gupta, U., Acun, B., Ardalani, N., Maeng, K., Chang, G., Aga, F., Huang, J., Bai, C., et al.: Sustainable ai: Environmental implications, challenges and opportunities. Proceedings of Machine Learning and Systems 4, 795–813 (2022) Kaack et al. [2022] Kaack, L.H., Donti, P.L., Strubell, E., Kamiya, G., Creutzig, F., Rolnick, D.: Aligning artificial intelligence with climate change mitigation. Nature Climate Change 12(6), 518–527 (2022) Zhou et al. [2022] Zhou, H., Nova, A., Larochelle, H., Courville, A., Neyshabur, B., Sedghi, H.: Teaching algorithmic reasoning via in-context learning. arXiv preprint arXiv:2211.09066 (2022) Izacard et al. [2022] Izacard, G., Lewis, P., Lomeli, M., Hosseini, L., Petroni, F., Schick, T., Dwivedi-Yu, J., Joulin, A., Riedel, S., Grave, E.: Few-shot learning with retrieval augmented language models. arXiv preprint arXiv:2208.03299 (2022) Wei et al. [2022] Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E., Le, Q., Zhou, D.: Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903 (2022) Wang et al. [2023] Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-Consistency Improves Chain of Thought Reasoning in Language Models (2023) Shao et al. [2023] Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Ji, Z., Lee, N., Frieske, R., Yu, T., Su, D., Xu, Y., Ishii, E., Bang, Y.J., Madotto, A., Fung, P.: Survey of hallucination in natural language generation. ACM Comput. Surv. 55(12) (2023) https://doi.org/10.1145/3571730 Strubell et al. [2020] Strubell, E., Ganesh, A., McCallum, A.: Energy and policy considerations for modern deep learning research. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, pp. 13693–13696 (2020) Patterson et al. [2021] Patterson, D., Gonzalez, J., Le, Q., Liang, C., Munguia, L.-M., Rothchild, D., So, D., Texier, M., Dean, J.: Carbon emissions and large neural network training. arXiv preprint arXiv:2104.10350 (2021) Wu et al. [2022] Wu, C.-J., Raghavendra, R., Gupta, U., Acun, B., Ardalani, N., Maeng, K., Chang, G., Aga, F., Huang, J., Bai, C., et al.: Sustainable ai: Environmental implications, challenges and opportunities. Proceedings of Machine Learning and Systems 4, 795–813 (2022) Kaack et al. [2022] Kaack, L.H., Donti, P.L., Strubell, E., Kamiya, G., Creutzig, F., Rolnick, D.: Aligning artificial intelligence with climate change mitigation. Nature Climate Change 12(6), 518–527 (2022) Zhou et al. [2022] Zhou, H., Nova, A., Larochelle, H., Courville, A., Neyshabur, B., Sedghi, H.: Teaching algorithmic reasoning via in-context learning. arXiv preprint arXiv:2211.09066 (2022) Izacard et al. [2022] Izacard, G., Lewis, P., Lomeli, M., Hosseini, L., Petroni, F., Schick, T., Dwivedi-Yu, J., Joulin, A., Riedel, S., Grave, E.: Few-shot learning with retrieval augmented language models. arXiv preprint arXiv:2208.03299 (2022) Wei et al. [2022] Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E., Le, Q., Zhou, D.: Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903 (2022) Wang et al. [2023] Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-Consistency Improves Chain of Thought Reasoning in Language Models (2023) Shao et al. [2023] Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Strubell, E., Ganesh, A., McCallum, A.: Energy and policy considerations for modern deep learning research. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, pp. 13693–13696 (2020) Patterson et al. [2021] Patterson, D., Gonzalez, J., Le, Q., Liang, C., Munguia, L.-M., Rothchild, D., So, D., Texier, M., Dean, J.: Carbon emissions and large neural network training. arXiv preprint arXiv:2104.10350 (2021) Wu et al. [2022] Wu, C.-J., Raghavendra, R., Gupta, U., Acun, B., Ardalani, N., Maeng, K., Chang, G., Aga, F., Huang, J., Bai, C., et al.: Sustainable ai: Environmental implications, challenges and opportunities. Proceedings of Machine Learning and Systems 4, 795–813 (2022) Kaack et al. [2022] Kaack, L.H., Donti, P.L., Strubell, E., Kamiya, G., Creutzig, F., Rolnick, D.: Aligning artificial intelligence with climate change mitigation. Nature Climate Change 12(6), 518–527 (2022) Zhou et al. [2022] Zhou, H., Nova, A., Larochelle, H., Courville, A., Neyshabur, B., Sedghi, H.: Teaching algorithmic reasoning via in-context learning. arXiv preprint arXiv:2211.09066 (2022) Izacard et al. [2022] Izacard, G., Lewis, P., Lomeli, M., Hosseini, L., Petroni, F., Schick, T., Dwivedi-Yu, J., Joulin, A., Riedel, S., Grave, E.: Few-shot learning with retrieval augmented language models. arXiv preprint arXiv:2208.03299 (2022) Wei et al. [2022] Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E., Le, Q., Zhou, D.: Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903 (2022) Wang et al. [2023] Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-Consistency Improves Chain of Thought Reasoning in Language Models (2023) Shao et al. [2023] Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Patterson, D., Gonzalez, J., Le, Q., Liang, C., Munguia, L.-M., Rothchild, D., So, D., Texier, M., Dean, J.: Carbon emissions and large neural network training. arXiv preprint arXiv:2104.10350 (2021) Wu et al. [2022] Wu, C.-J., Raghavendra, R., Gupta, U., Acun, B., Ardalani, N., Maeng, K., Chang, G., Aga, F., Huang, J., Bai, C., et al.: Sustainable ai: Environmental implications, challenges and opportunities. Proceedings of Machine Learning and Systems 4, 795–813 (2022) Kaack et al. [2022] Kaack, L.H., Donti, P.L., Strubell, E., Kamiya, G., Creutzig, F., Rolnick, D.: Aligning artificial intelligence with climate change mitigation. Nature Climate Change 12(6), 518–527 (2022) Zhou et al. [2022] Zhou, H., Nova, A., Larochelle, H., Courville, A., Neyshabur, B., Sedghi, H.: Teaching algorithmic reasoning via in-context learning. arXiv preprint arXiv:2211.09066 (2022) Izacard et al. [2022] Izacard, G., Lewis, P., Lomeli, M., Hosseini, L., Petroni, F., Schick, T., Dwivedi-Yu, J., Joulin, A., Riedel, S., Grave, E.: Few-shot learning with retrieval augmented language models. arXiv preprint arXiv:2208.03299 (2022) Wei et al. [2022] Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E., Le, Q., Zhou, D.: Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903 (2022) Wang et al. [2023] Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-Consistency Improves Chain of Thought Reasoning in Language Models (2023) Shao et al. [2023] Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Wu, C.-J., Raghavendra, R., Gupta, U., Acun, B., Ardalani, N., Maeng, K., Chang, G., Aga, F., Huang, J., Bai, C., et al.: Sustainable ai: Environmental implications, challenges and opportunities. Proceedings of Machine Learning and Systems 4, 795–813 (2022) Kaack et al. [2022] Kaack, L.H., Donti, P.L., Strubell, E., Kamiya, G., Creutzig, F., Rolnick, D.: Aligning artificial intelligence with climate change mitigation. Nature Climate Change 12(6), 518–527 (2022) Zhou et al. [2022] Zhou, H., Nova, A., Larochelle, H., Courville, A., Neyshabur, B., Sedghi, H.: Teaching algorithmic reasoning via in-context learning. arXiv preprint arXiv:2211.09066 (2022) Izacard et al. [2022] Izacard, G., Lewis, P., Lomeli, M., Hosseini, L., Petroni, F., Schick, T., Dwivedi-Yu, J., Joulin, A., Riedel, S., Grave, E.: Few-shot learning with retrieval augmented language models. arXiv preprint arXiv:2208.03299 (2022) Wei et al. [2022] Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E., Le, Q., Zhou, D.: Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903 (2022) Wang et al. [2023] Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-Consistency Improves Chain of Thought Reasoning in Language Models (2023) Shao et al. [2023] Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Kaack, L.H., Donti, P.L., Strubell, E., Kamiya, G., Creutzig, F., Rolnick, D.: Aligning artificial intelligence with climate change mitigation. Nature Climate Change 12(6), 518–527 (2022) Zhou et al. [2022] Zhou, H., Nova, A., Larochelle, H., Courville, A., Neyshabur, B., Sedghi, H.: Teaching algorithmic reasoning via in-context learning. arXiv preprint arXiv:2211.09066 (2022) Izacard et al. [2022] Izacard, G., Lewis, P., Lomeli, M., Hosseini, L., Petroni, F., Schick, T., Dwivedi-Yu, J., Joulin, A., Riedel, S., Grave, E.: Few-shot learning with retrieval augmented language models. arXiv preprint arXiv:2208.03299 (2022) Wei et al. [2022] Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E., Le, Q., Zhou, D.: Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903 (2022) Wang et al. [2023] Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-Consistency Improves Chain of Thought Reasoning in Language Models (2023) Shao et al. [2023] Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Zhou, H., Nova, A., Larochelle, H., Courville, A., Neyshabur, B., Sedghi, H.: Teaching algorithmic reasoning via in-context learning. arXiv preprint arXiv:2211.09066 (2022) Izacard et al. [2022] Izacard, G., Lewis, P., Lomeli, M., Hosseini, L., Petroni, F., Schick, T., Dwivedi-Yu, J., Joulin, A., Riedel, S., Grave, E.: Few-shot learning with retrieval augmented language models. arXiv preprint arXiv:2208.03299 (2022) Wei et al. [2022] Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E., Le, Q., Zhou, D.: Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903 (2022) Wang et al. [2023] Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-Consistency Improves Chain of Thought Reasoning in Language Models (2023) Shao et al. [2023] Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Izacard, G., Lewis, P., Lomeli, M., Hosseini, L., Petroni, F., Schick, T., Dwivedi-Yu, J., Joulin, A., Riedel, S., Grave, E.: Few-shot learning with retrieval augmented language models. arXiv preprint arXiv:2208.03299 (2022) Wei et al. [2022] Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E., Le, Q., Zhou, D.: Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903 (2022) Wang et al. [2023] Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-Consistency Improves Chain of Thought Reasoning in Language Models (2023) Shao et al. [2023] Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E., Le, Q., Zhou, D.: Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903 (2022) Wang et al. [2023] Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-Consistency Improves Chain of Thought Reasoning in Language Models (2023) Shao et al. [2023] Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-Consistency Improves Chain of Thought Reasoning in Language Models (2023) Shao et al. [2023] Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023)
- Ji, Z., Lee, N., Frieske, R., Yu, T., Su, D., Xu, Y., Ishii, E., Bang, Y.J., Madotto, A., Fung, P.: Survey of hallucination in natural language generation. ACM Comput. Surv. 55(12) (2023) https://doi.org/10.1145/3571730 Strubell et al. [2020] Strubell, E., Ganesh, A., McCallum, A.: Energy and policy considerations for modern deep learning research. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, pp. 13693–13696 (2020) Patterson et al. [2021] Patterson, D., Gonzalez, J., Le, Q., Liang, C., Munguia, L.-M., Rothchild, D., So, D., Texier, M., Dean, J.: Carbon emissions and large neural network training. arXiv preprint arXiv:2104.10350 (2021) Wu et al. [2022] Wu, C.-J., Raghavendra, R., Gupta, U., Acun, B., Ardalani, N., Maeng, K., Chang, G., Aga, F., Huang, J., Bai, C., et al.: Sustainable ai: Environmental implications, challenges and opportunities. Proceedings of Machine Learning and Systems 4, 795–813 (2022) Kaack et al. [2022] Kaack, L.H., Donti, P.L., Strubell, E., Kamiya, G., Creutzig, F., Rolnick, D.: Aligning artificial intelligence with climate change mitigation. Nature Climate Change 12(6), 518–527 (2022) Zhou et al. [2022] Zhou, H., Nova, A., Larochelle, H., Courville, A., Neyshabur, B., Sedghi, H.: Teaching algorithmic reasoning via in-context learning. arXiv preprint arXiv:2211.09066 (2022) Izacard et al. [2022] Izacard, G., Lewis, P., Lomeli, M., Hosseini, L., Petroni, F., Schick, T., Dwivedi-Yu, J., Joulin, A., Riedel, S., Grave, E.: Few-shot learning with retrieval augmented language models. arXiv preprint arXiv:2208.03299 (2022) Wei et al. [2022] Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E., Le, Q., Zhou, D.: Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903 (2022) Wang et al. [2023] Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-Consistency Improves Chain of Thought Reasoning in Language Models (2023) Shao et al. [2023] Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Strubell, E., Ganesh, A., McCallum, A.: Energy and policy considerations for modern deep learning research. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, pp. 13693–13696 (2020) Patterson et al. [2021] Patterson, D., Gonzalez, J., Le, Q., Liang, C., Munguia, L.-M., Rothchild, D., So, D., Texier, M., Dean, J.: Carbon emissions and large neural network training. arXiv preprint arXiv:2104.10350 (2021) Wu et al. [2022] Wu, C.-J., Raghavendra, R., Gupta, U., Acun, B., Ardalani, N., Maeng, K., Chang, G., Aga, F., Huang, J., Bai, C., et al.: Sustainable ai: Environmental implications, challenges and opportunities. Proceedings of Machine Learning and Systems 4, 795–813 (2022) Kaack et al. [2022] Kaack, L.H., Donti, P.L., Strubell, E., Kamiya, G., Creutzig, F., Rolnick, D.: Aligning artificial intelligence with climate change mitigation. Nature Climate Change 12(6), 518–527 (2022) Zhou et al. [2022] Zhou, H., Nova, A., Larochelle, H., Courville, A., Neyshabur, B., Sedghi, H.: Teaching algorithmic reasoning via in-context learning. arXiv preprint arXiv:2211.09066 (2022) Izacard et al. [2022] Izacard, G., Lewis, P., Lomeli, M., Hosseini, L., Petroni, F., Schick, T., Dwivedi-Yu, J., Joulin, A., Riedel, S., Grave, E.: Few-shot learning with retrieval augmented language models. arXiv preprint arXiv:2208.03299 (2022) Wei et al. [2022] Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E., Le, Q., Zhou, D.: Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903 (2022) Wang et al. [2023] Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-Consistency Improves Chain of Thought Reasoning in Language Models (2023) Shao et al. [2023] Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Patterson, D., Gonzalez, J., Le, Q., Liang, C., Munguia, L.-M., Rothchild, D., So, D., Texier, M., Dean, J.: Carbon emissions and large neural network training. arXiv preprint arXiv:2104.10350 (2021) Wu et al. [2022] Wu, C.-J., Raghavendra, R., Gupta, U., Acun, B., Ardalani, N., Maeng, K., Chang, G., Aga, F., Huang, J., Bai, C., et al.: Sustainable ai: Environmental implications, challenges and opportunities. Proceedings of Machine Learning and Systems 4, 795–813 (2022) Kaack et al. [2022] Kaack, L.H., Donti, P.L., Strubell, E., Kamiya, G., Creutzig, F., Rolnick, D.: Aligning artificial intelligence with climate change mitigation. Nature Climate Change 12(6), 518–527 (2022) Zhou et al. [2022] Zhou, H., Nova, A., Larochelle, H., Courville, A., Neyshabur, B., Sedghi, H.: Teaching algorithmic reasoning via in-context learning. arXiv preprint arXiv:2211.09066 (2022) Izacard et al. [2022] Izacard, G., Lewis, P., Lomeli, M., Hosseini, L., Petroni, F., Schick, T., Dwivedi-Yu, J., Joulin, A., Riedel, S., Grave, E.: Few-shot learning with retrieval augmented language models. arXiv preprint arXiv:2208.03299 (2022) Wei et al. [2022] Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E., Le, Q., Zhou, D.: Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903 (2022) Wang et al. [2023] Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-Consistency Improves Chain of Thought Reasoning in Language Models (2023) Shao et al. [2023] Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Wu, C.-J., Raghavendra, R., Gupta, U., Acun, B., Ardalani, N., Maeng, K., Chang, G., Aga, F., Huang, J., Bai, C., et al.: Sustainable ai: Environmental implications, challenges and opportunities. Proceedings of Machine Learning and Systems 4, 795–813 (2022) Kaack et al. [2022] Kaack, L.H., Donti, P.L., Strubell, E., Kamiya, G., Creutzig, F., Rolnick, D.: Aligning artificial intelligence with climate change mitigation. Nature Climate Change 12(6), 518–527 (2022) Zhou et al. [2022] Zhou, H., Nova, A., Larochelle, H., Courville, A., Neyshabur, B., Sedghi, H.: Teaching algorithmic reasoning via in-context learning. arXiv preprint arXiv:2211.09066 (2022) Izacard et al. [2022] Izacard, G., Lewis, P., Lomeli, M., Hosseini, L., Petroni, F., Schick, T., Dwivedi-Yu, J., Joulin, A., Riedel, S., Grave, E.: Few-shot learning with retrieval augmented language models. arXiv preprint arXiv:2208.03299 (2022) Wei et al. [2022] Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E., Le, Q., Zhou, D.: Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903 (2022) Wang et al. [2023] Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-Consistency Improves Chain of Thought Reasoning in Language Models (2023) Shao et al. [2023] Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Kaack, L.H., Donti, P.L., Strubell, E., Kamiya, G., Creutzig, F., Rolnick, D.: Aligning artificial intelligence with climate change mitigation. Nature Climate Change 12(6), 518–527 (2022) Zhou et al. [2022] Zhou, H., Nova, A., Larochelle, H., Courville, A., Neyshabur, B., Sedghi, H.: Teaching algorithmic reasoning via in-context learning. arXiv preprint arXiv:2211.09066 (2022) Izacard et al. [2022] Izacard, G., Lewis, P., Lomeli, M., Hosseini, L., Petroni, F., Schick, T., Dwivedi-Yu, J., Joulin, A., Riedel, S., Grave, E.: Few-shot learning with retrieval augmented language models. arXiv preprint arXiv:2208.03299 (2022) Wei et al. [2022] Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E., Le, Q., Zhou, D.: Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903 (2022) Wang et al. [2023] Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-Consistency Improves Chain of Thought Reasoning in Language Models (2023) Shao et al. [2023] Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Zhou, H., Nova, A., Larochelle, H., Courville, A., Neyshabur, B., Sedghi, H.: Teaching algorithmic reasoning via in-context learning. arXiv preprint arXiv:2211.09066 (2022) Izacard et al. [2022] Izacard, G., Lewis, P., Lomeli, M., Hosseini, L., Petroni, F., Schick, T., Dwivedi-Yu, J., Joulin, A., Riedel, S., Grave, E.: Few-shot learning with retrieval augmented language models. arXiv preprint arXiv:2208.03299 (2022) Wei et al. [2022] Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E., Le, Q., Zhou, D.: Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903 (2022) Wang et al. [2023] Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-Consistency Improves Chain of Thought Reasoning in Language Models (2023) Shao et al. [2023] Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Izacard, G., Lewis, P., Lomeli, M., Hosseini, L., Petroni, F., Schick, T., Dwivedi-Yu, J., Joulin, A., Riedel, S., Grave, E.: Few-shot learning with retrieval augmented language models. arXiv preprint arXiv:2208.03299 (2022) Wei et al. [2022] Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E., Le, Q., Zhou, D.: Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903 (2022) Wang et al. [2023] Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-Consistency Improves Chain of Thought Reasoning in Language Models (2023) Shao et al. [2023] Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E., Le, Q., Zhou, D.: Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903 (2022) Wang et al. [2023] Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-Consistency Improves Chain of Thought Reasoning in Language Models (2023) Shao et al. [2023] Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-Consistency Improves Chain of Thought Reasoning in Language Models (2023) Shao et al. [2023] Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023)
- Strubell, E., Ganesh, A., McCallum, A.: Energy and policy considerations for modern deep learning research. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, pp. 13693–13696 (2020) Patterson et al. [2021] Patterson, D., Gonzalez, J., Le, Q., Liang, C., Munguia, L.-M., Rothchild, D., So, D., Texier, M., Dean, J.: Carbon emissions and large neural network training. arXiv preprint arXiv:2104.10350 (2021) Wu et al. [2022] Wu, C.-J., Raghavendra, R., Gupta, U., Acun, B., Ardalani, N., Maeng, K., Chang, G., Aga, F., Huang, J., Bai, C., et al.: Sustainable ai: Environmental implications, challenges and opportunities. Proceedings of Machine Learning and Systems 4, 795–813 (2022) Kaack et al. [2022] Kaack, L.H., Donti, P.L., Strubell, E., Kamiya, G., Creutzig, F., Rolnick, D.: Aligning artificial intelligence with climate change mitigation. Nature Climate Change 12(6), 518–527 (2022) Zhou et al. [2022] Zhou, H., Nova, A., Larochelle, H., Courville, A., Neyshabur, B., Sedghi, H.: Teaching algorithmic reasoning via in-context learning. arXiv preprint arXiv:2211.09066 (2022) Izacard et al. [2022] Izacard, G., Lewis, P., Lomeli, M., Hosseini, L., Petroni, F., Schick, T., Dwivedi-Yu, J., Joulin, A., Riedel, S., Grave, E.: Few-shot learning with retrieval augmented language models. arXiv preprint arXiv:2208.03299 (2022) Wei et al. [2022] Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E., Le, Q., Zhou, D.: Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903 (2022) Wang et al. [2023] Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-Consistency Improves Chain of Thought Reasoning in Language Models (2023) Shao et al. [2023] Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Patterson, D., Gonzalez, J., Le, Q., Liang, C., Munguia, L.-M., Rothchild, D., So, D., Texier, M., Dean, J.: Carbon emissions and large neural network training. arXiv preprint arXiv:2104.10350 (2021) Wu et al. [2022] Wu, C.-J., Raghavendra, R., Gupta, U., Acun, B., Ardalani, N., Maeng, K., Chang, G., Aga, F., Huang, J., Bai, C., et al.: Sustainable ai: Environmental implications, challenges and opportunities. Proceedings of Machine Learning and Systems 4, 795–813 (2022) Kaack et al. [2022] Kaack, L.H., Donti, P.L., Strubell, E., Kamiya, G., Creutzig, F., Rolnick, D.: Aligning artificial intelligence with climate change mitigation. Nature Climate Change 12(6), 518–527 (2022) Zhou et al. [2022] Zhou, H., Nova, A., Larochelle, H., Courville, A., Neyshabur, B., Sedghi, H.: Teaching algorithmic reasoning via in-context learning. arXiv preprint arXiv:2211.09066 (2022) Izacard et al. [2022] Izacard, G., Lewis, P., Lomeli, M., Hosseini, L., Petroni, F., Schick, T., Dwivedi-Yu, J., Joulin, A., Riedel, S., Grave, E.: Few-shot learning with retrieval augmented language models. arXiv preprint arXiv:2208.03299 (2022) Wei et al. [2022] Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E., Le, Q., Zhou, D.: Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903 (2022) Wang et al. [2023] Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-Consistency Improves Chain of Thought Reasoning in Language Models (2023) Shao et al. [2023] Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Wu, C.-J., Raghavendra, R., Gupta, U., Acun, B., Ardalani, N., Maeng, K., Chang, G., Aga, F., Huang, J., Bai, C., et al.: Sustainable ai: Environmental implications, challenges and opportunities. Proceedings of Machine Learning and Systems 4, 795–813 (2022) Kaack et al. [2022] Kaack, L.H., Donti, P.L., Strubell, E., Kamiya, G., Creutzig, F., Rolnick, D.: Aligning artificial intelligence with climate change mitigation. Nature Climate Change 12(6), 518–527 (2022) Zhou et al. [2022] Zhou, H., Nova, A., Larochelle, H., Courville, A., Neyshabur, B., Sedghi, H.: Teaching algorithmic reasoning via in-context learning. arXiv preprint arXiv:2211.09066 (2022) Izacard et al. [2022] Izacard, G., Lewis, P., Lomeli, M., Hosseini, L., Petroni, F., Schick, T., Dwivedi-Yu, J., Joulin, A., Riedel, S., Grave, E.: Few-shot learning with retrieval augmented language models. arXiv preprint arXiv:2208.03299 (2022) Wei et al. [2022] Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E., Le, Q., Zhou, D.: Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903 (2022) Wang et al. [2023] Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-Consistency Improves Chain of Thought Reasoning in Language Models (2023) Shao et al. [2023] Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Kaack, L.H., Donti, P.L., Strubell, E., Kamiya, G., Creutzig, F., Rolnick, D.: Aligning artificial intelligence with climate change mitigation. Nature Climate Change 12(6), 518–527 (2022) Zhou et al. [2022] Zhou, H., Nova, A., Larochelle, H., Courville, A., Neyshabur, B., Sedghi, H.: Teaching algorithmic reasoning via in-context learning. arXiv preprint arXiv:2211.09066 (2022) Izacard et al. [2022] Izacard, G., Lewis, P., Lomeli, M., Hosseini, L., Petroni, F., Schick, T., Dwivedi-Yu, J., Joulin, A., Riedel, S., Grave, E.: Few-shot learning with retrieval augmented language models. arXiv preprint arXiv:2208.03299 (2022) Wei et al. [2022] Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E., Le, Q., Zhou, D.: Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903 (2022) Wang et al. [2023] Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-Consistency Improves Chain of Thought Reasoning in Language Models (2023) Shao et al. [2023] Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Zhou, H., Nova, A., Larochelle, H., Courville, A., Neyshabur, B., Sedghi, H.: Teaching algorithmic reasoning via in-context learning. arXiv preprint arXiv:2211.09066 (2022) Izacard et al. [2022] Izacard, G., Lewis, P., Lomeli, M., Hosseini, L., Petroni, F., Schick, T., Dwivedi-Yu, J., Joulin, A., Riedel, S., Grave, E.: Few-shot learning with retrieval augmented language models. arXiv preprint arXiv:2208.03299 (2022) Wei et al. [2022] Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E., Le, Q., Zhou, D.: Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903 (2022) Wang et al. [2023] Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-Consistency Improves Chain of Thought Reasoning in Language Models (2023) Shao et al. [2023] Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Izacard, G., Lewis, P., Lomeli, M., Hosseini, L., Petroni, F., Schick, T., Dwivedi-Yu, J., Joulin, A., Riedel, S., Grave, E.: Few-shot learning with retrieval augmented language models. arXiv preprint arXiv:2208.03299 (2022) Wei et al. [2022] Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E., Le, Q., Zhou, D.: Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903 (2022) Wang et al. [2023] Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-Consistency Improves Chain of Thought Reasoning in Language Models (2023) Shao et al. [2023] Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E., Le, Q., Zhou, D.: Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903 (2022) Wang et al. [2023] Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-Consistency Improves Chain of Thought Reasoning in Language Models (2023) Shao et al. [2023] Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-Consistency Improves Chain of Thought Reasoning in Language Models (2023) Shao et al. [2023] Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023)
- Patterson, D., Gonzalez, J., Le, Q., Liang, C., Munguia, L.-M., Rothchild, D., So, D., Texier, M., Dean, J.: Carbon emissions and large neural network training. arXiv preprint arXiv:2104.10350 (2021) Wu et al. [2022] Wu, C.-J., Raghavendra, R., Gupta, U., Acun, B., Ardalani, N., Maeng, K., Chang, G., Aga, F., Huang, J., Bai, C., et al.: Sustainable ai: Environmental implications, challenges and opportunities. Proceedings of Machine Learning and Systems 4, 795–813 (2022) Kaack et al. [2022] Kaack, L.H., Donti, P.L., Strubell, E., Kamiya, G., Creutzig, F., Rolnick, D.: Aligning artificial intelligence with climate change mitigation. Nature Climate Change 12(6), 518–527 (2022) Zhou et al. [2022] Zhou, H., Nova, A., Larochelle, H., Courville, A., Neyshabur, B., Sedghi, H.: Teaching algorithmic reasoning via in-context learning. arXiv preprint arXiv:2211.09066 (2022) Izacard et al. [2022] Izacard, G., Lewis, P., Lomeli, M., Hosseini, L., Petroni, F., Schick, T., Dwivedi-Yu, J., Joulin, A., Riedel, S., Grave, E.: Few-shot learning with retrieval augmented language models. arXiv preprint arXiv:2208.03299 (2022) Wei et al. [2022] Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E., Le, Q., Zhou, D.: Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903 (2022) Wang et al. [2023] Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-Consistency Improves Chain of Thought Reasoning in Language Models (2023) Shao et al. [2023] Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Wu, C.-J., Raghavendra, R., Gupta, U., Acun, B., Ardalani, N., Maeng, K., Chang, G., Aga, F., Huang, J., Bai, C., et al.: Sustainable ai: Environmental implications, challenges and opportunities. Proceedings of Machine Learning and Systems 4, 795–813 (2022) Kaack et al. [2022] Kaack, L.H., Donti, P.L., Strubell, E., Kamiya, G., Creutzig, F., Rolnick, D.: Aligning artificial intelligence with climate change mitigation. Nature Climate Change 12(6), 518–527 (2022) Zhou et al. [2022] Zhou, H., Nova, A., Larochelle, H., Courville, A., Neyshabur, B., Sedghi, H.: Teaching algorithmic reasoning via in-context learning. arXiv preprint arXiv:2211.09066 (2022) Izacard et al. [2022] Izacard, G., Lewis, P., Lomeli, M., Hosseini, L., Petroni, F., Schick, T., Dwivedi-Yu, J., Joulin, A., Riedel, S., Grave, E.: Few-shot learning with retrieval augmented language models. arXiv preprint arXiv:2208.03299 (2022) Wei et al. [2022] Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E., Le, Q., Zhou, D.: Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903 (2022) Wang et al. [2023] Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-Consistency Improves Chain of Thought Reasoning in Language Models (2023) Shao et al. [2023] Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Kaack, L.H., Donti, P.L., Strubell, E., Kamiya, G., Creutzig, F., Rolnick, D.: Aligning artificial intelligence with climate change mitigation. Nature Climate Change 12(6), 518–527 (2022) Zhou et al. [2022] Zhou, H., Nova, A., Larochelle, H., Courville, A., Neyshabur, B., Sedghi, H.: Teaching algorithmic reasoning via in-context learning. arXiv preprint arXiv:2211.09066 (2022) Izacard et al. [2022] Izacard, G., Lewis, P., Lomeli, M., Hosseini, L., Petroni, F., Schick, T., Dwivedi-Yu, J., Joulin, A., Riedel, S., Grave, E.: Few-shot learning with retrieval augmented language models. arXiv preprint arXiv:2208.03299 (2022) Wei et al. [2022] Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E., Le, Q., Zhou, D.: Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903 (2022) Wang et al. [2023] Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-Consistency Improves Chain of Thought Reasoning in Language Models (2023) Shao et al. [2023] Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Zhou, H., Nova, A., Larochelle, H., Courville, A., Neyshabur, B., Sedghi, H.: Teaching algorithmic reasoning via in-context learning. arXiv preprint arXiv:2211.09066 (2022) Izacard et al. [2022] Izacard, G., Lewis, P., Lomeli, M., Hosseini, L., Petroni, F., Schick, T., Dwivedi-Yu, J., Joulin, A., Riedel, S., Grave, E.: Few-shot learning with retrieval augmented language models. arXiv preprint arXiv:2208.03299 (2022) Wei et al. [2022] Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E., Le, Q., Zhou, D.: Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903 (2022) Wang et al. [2023] Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-Consistency Improves Chain of Thought Reasoning in Language Models (2023) Shao et al. [2023] Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Izacard, G., Lewis, P., Lomeli, M., Hosseini, L., Petroni, F., Schick, T., Dwivedi-Yu, J., Joulin, A., Riedel, S., Grave, E.: Few-shot learning with retrieval augmented language models. arXiv preprint arXiv:2208.03299 (2022) Wei et al. [2022] Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E., Le, Q., Zhou, D.: Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903 (2022) Wang et al. [2023] Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-Consistency Improves Chain of Thought Reasoning in Language Models (2023) Shao et al. [2023] Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E., Le, Q., Zhou, D.: Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903 (2022) Wang et al. [2023] Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-Consistency Improves Chain of Thought Reasoning in Language Models (2023) Shao et al. [2023] Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-Consistency Improves Chain of Thought Reasoning in Language Models (2023) Shao et al. [2023] Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023)
- Wu, C.-J., Raghavendra, R., Gupta, U., Acun, B., Ardalani, N., Maeng, K., Chang, G., Aga, F., Huang, J., Bai, C., et al.: Sustainable ai: Environmental implications, challenges and opportunities. Proceedings of Machine Learning and Systems 4, 795–813 (2022) Kaack et al. [2022] Kaack, L.H., Donti, P.L., Strubell, E., Kamiya, G., Creutzig, F., Rolnick, D.: Aligning artificial intelligence with climate change mitigation. Nature Climate Change 12(6), 518–527 (2022) Zhou et al. [2022] Zhou, H., Nova, A., Larochelle, H., Courville, A., Neyshabur, B., Sedghi, H.: Teaching algorithmic reasoning via in-context learning. arXiv preprint arXiv:2211.09066 (2022) Izacard et al. [2022] Izacard, G., Lewis, P., Lomeli, M., Hosseini, L., Petroni, F., Schick, T., Dwivedi-Yu, J., Joulin, A., Riedel, S., Grave, E.: Few-shot learning with retrieval augmented language models. arXiv preprint arXiv:2208.03299 (2022) Wei et al. [2022] Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E., Le, Q., Zhou, D.: Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903 (2022) Wang et al. [2023] Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-Consistency Improves Chain of Thought Reasoning in Language Models (2023) Shao et al. [2023] Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Kaack, L.H., Donti, P.L., Strubell, E., Kamiya, G., Creutzig, F., Rolnick, D.: Aligning artificial intelligence with climate change mitigation. Nature Climate Change 12(6), 518–527 (2022) Zhou et al. [2022] Zhou, H., Nova, A., Larochelle, H., Courville, A., Neyshabur, B., Sedghi, H.: Teaching algorithmic reasoning via in-context learning. arXiv preprint arXiv:2211.09066 (2022) Izacard et al. [2022] Izacard, G., Lewis, P., Lomeli, M., Hosseini, L., Petroni, F., Schick, T., Dwivedi-Yu, J., Joulin, A., Riedel, S., Grave, E.: Few-shot learning with retrieval augmented language models. arXiv preprint arXiv:2208.03299 (2022) Wei et al. [2022] Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E., Le, Q., Zhou, D.: Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903 (2022) Wang et al. [2023] Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-Consistency Improves Chain of Thought Reasoning in Language Models (2023) Shao et al. [2023] Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Zhou, H., Nova, A., Larochelle, H., Courville, A., Neyshabur, B., Sedghi, H.: Teaching algorithmic reasoning via in-context learning. arXiv preprint arXiv:2211.09066 (2022) Izacard et al. [2022] Izacard, G., Lewis, P., Lomeli, M., Hosseini, L., Petroni, F., Schick, T., Dwivedi-Yu, J., Joulin, A., Riedel, S., Grave, E.: Few-shot learning with retrieval augmented language models. arXiv preprint arXiv:2208.03299 (2022) Wei et al. [2022] Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E., Le, Q., Zhou, D.: Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903 (2022) Wang et al. [2023] Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-Consistency Improves Chain of Thought Reasoning in Language Models (2023) Shao et al. [2023] Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Izacard, G., Lewis, P., Lomeli, M., Hosseini, L., Petroni, F., Schick, T., Dwivedi-Yu, J., Joulin, A., Riedel, S., Grave, E.: Few-shot learning with retrieval augmented language models. arXiv preprint arXiv:2208.03299 (2022) Wei et al. [2022] Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E., Le, Q., Zhou, D.: Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903 (2022) Wang et al. [2023] Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-Consistency Improves Chain of Thought Reasoning in Language Models (2023) Shao et al. [2023] Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E., Le, Q., Zhou, D.: Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903 (2022) Wang et al. [2023] Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-Consistency Improves Chain of Thought Reasoning in Language Models (2023) Shao et al. [2023] Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-Consistency Improves Chain of Thought Reasoning in Language Models (2023) Shao et al. [2023] Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023)
- Kaack, L.H., Donti, P.L., Strubell, E., Kamiya, G., Creutzig, F., Rolnick, D.: Aligning artificial intelligence with climate change mitigation. Nature Climate Change 12(6), 518–527 (2022) Zhou et al. [2022] Zhou, H., Nova, A., Larochelle, H., Courville, A., Neyshabur, B., Sedghi, H.: Teaching algorithmic reasoning via in-context learning. arXiv preprint arXiv:2211.09066 (2022) Izacard et al. [2022] Izacard, G., Lewis, P., Lomeli, M., Hosseini, L., Petroni, F., Schick, T., Dwivedi-Yu, J., Joulin, A., Riedel, S., Grave, E.: Few-shot learning with retrieval augmented language models. arXiv preprint arXiv:2208.03299 (2022) Wei et al. [2022] Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E., Le, Q., Zhou, D.: Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903 (2022) Wang et al. [2023] Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-Consistency Improves Chain of Thought Reasoning in Language Models (2023) Shao et al. [2023] Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Zhou, H., Nova, A., Larochelle, H., Courville, A., Neyshabur, B., Sedghi, H.: Teaching algorithmic reasoning via in-context learning. arXiv preprint arXiv:2211.09066 (2022) Izacard et al. [2022] Izacard, G., Lewis, P., Lomeli, M., Hosseini, L., Petroni, F., Schick, T., Dwivedi-Yu, J., Joulin, A., Riedel, S., Grave, E.: Few-shot learning with retrieval augmented language models. arXiv preprint arXiv:2208.03299 (2022) Wei et al. [2022] Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E., Le, Q., Zhou, D.: Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903 (2022) Wang et al. [2023] Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-Consistency Improves Chain of Thought Reasoning in Language Models (2023) Shao et al. [2023] Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Izacard, G., Lewis, P., Lomeli, M., Hosseini, L., Petroni, F., Schick, T., Dwivedi-Yu, J., Joulin, A., Riedel, S., Grave, E.: Few-shot learning with retrieval augmented language models. arXiv preprint arXiv:2208.03299 (2022) Wei et al. [2022] Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E., Le, Q., Zhou, D.: Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903 (2022) Wang et al. [2023] Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-Consistency Improves Chain of Thought Reasoning in Language Models (2023) Shao et al. [2023] Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E., Le, Q., Zhou, D.: Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903 (2022) Wang et al. [2023] Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-Consistency Improves Chain of Thought Reasoning in Language Models (2023) Shao et al. [2023] Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-Consistency Improves Chain of Thought Reasoning in Language Models (2023) Shao et al. [2023] Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023)
- Zhou, H., Nova, A., Larochelle, H., Courville, A., Neyshabur, B., Sedghi, H.: Teaching algorithmic reasoning via in-context learning. arXiv preprint arXiv:2211.09066 (2022) Izacard et al. [2022] Izacard, G., Lewis, P., Lomeli, M., Hosseini, L., Petroni, F., Schick, T., Dwivedi-Yu, J., Joulin, A., Riedel, S., Grave, E.: Few-shot learning with retrieval augmented language models. arXiv preprint arXiv:2208.03299 (2022) Wei et al. [2022] Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E., Le, Q., Zhou, D.: Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903 (2022) Wang et al. [2023] Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-Consistency Improves Chain of Thought Reasoning in Language Models (2023) Shao et al. [2023] Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Izacard, G., Lewis, P., Lomeli, M., Hosseini, L., Petroni, F., Schick, T., Dwivedi-Yu, J., Joulin, A., Riedel, S., Grave, E.: Few-shot learning with retrieval augmented language models. arXiv preprint arXiv:2208.03299 (2022) Wei et al. [2022] Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E., Le, Q., Zhou, D.: Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903 (2022) Wang et al. [2023] Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-Consistency Improves Chain of Thought Reasoning in Language Models (2023) Shao et al. [2023] Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E., Le, Q., Zhou, D.: Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903 (2022) Wang et al. [2023] Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-Consistency Improves Chain of Thought Reasoning in Language Models (2023) Shao et al. [2023] Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-Consistency Improves Chain of Thought Reasoning in Language Models (2023) Shao et al. [2023] Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023)
- Izacard, G., Lewis, P., Lomeli, M., Hosseini, L., Petroni, F., Schick, T., Dwivedi-Yu, J., Joulin, A., Riedel, S., Grave, E.: Few-shot learning with retrieval augmented language models. arXiv preprint arXiv:2208.03299 (2022) Wei et al. [2022] Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E., Le, Q., Zhou, D.: Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903 (2022) Wang et al. [2023] Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-Consistency Improves Chain of Thought Reasoning in Language Models (2023) Shao et al. [2023] Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E., Le, Q., Zhou, D.: Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903 (2022) Wang et al. [2023] Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-Consistency Improves Chain of Thought Reasoning in Language Models (2023) Shao et al. [2023] Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-Consistency Improves Chain of Thought Reasoning in Language Models (2023) Shao et al. [2023] Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023)
- Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E., Le, Q., Zhou, D.: Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903 (2022) Wang et al. [2023] Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-Consistency Improves Chain of Thought Reasoning in Language Models (2023) Shao et al. [2023] Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-Consistency Improves Chain of Thought Reasoning in Language Models (2023) Shao et al. [2023] Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023)
- Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-Consistency Improves Chain of Thought Reasoning in Language Models (2023) Shao et al. [2023] Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023)
- Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., Chen, W.: Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618 (2023) Yao et al. [2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023)
- Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023) Raji et al. [2020] Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023)
- Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., Denton, E.: Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020) Appel et al. [2023] Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023)
- Appel, G., Neelbauer, J., Schweidel, D.: Generative ai has an intellectual property problem. april 07, 2023. Harvard Business Review (2023) Chen et al. [2023] Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023)
- Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023) Du et al. [2023] Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023)
- Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023) Berglund et al. [2023] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023)
- Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., Evans, O.: The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 (2023) Moskvichev et al. [2023] Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023)
- Moskvichev, A., Odouard, V.V., Mitchell, M.: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain (2023) Ding et al. [2023] Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023)
- Ding, Z., Srinivasan, A., MacNeil, S., Chan, J.: Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity. In: Proceedings of the 15th Conference on Creativity and Cognition, pp. 489–505 (2023) [31] On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023)
- On Evaluating Understanding and Generalization in the ARC Domain. https://aiguide.substack.com/p/on-evaluating-understanding-and-generalization. Accessed: 2023-10-27 [32] Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023)
- Connectionists: Chomsky’s apple. https://mailman.srv.cs.cmu.edu/pipermail/connectionists/2023-March/039546.html. Accessed: 2023-10-27 Roziere et al. [2023] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023)
- Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023) [34] Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023)
- Preparatory Steps of Genetic Programming. http://www.genetic-programming.com/gppreparatory.html. Accessed: 2023-10-27 Ling et al. [2023] Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023)
- Ling, T., Chen, L., Lai, Y., Liu, H.-L.: Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification (2023) Zelikman et al. [2023] Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023)
- Zelikman, E., Lorch, E., Mackey, L., Kalai, A.T.: Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation (2023) Lehman et al. [2022] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023)
- Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: Evolution through large models. arXiv preprint arXiv:2206.08896 (2022) Meyerson et al. [2023] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023)
- Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman, J.: Language Model Crossover: Variation through Few-Shot Prompting (2023) Ma et al. [2023] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023)
- Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023) Nasir et al. [2023] Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023)
- Nasir, M.U., Earle, S., Togelius, J., James, S.D., Cleghorn, C.W.: Llmatic: Neural architecture search via large language models and quality-diversity optimization. ArXiv abs/2306.01102 (2023) Guo et al. [2023] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023)
- Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (2023) Fernando et al. [2023] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023)
- Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) Xu et al. [2023] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023)
- Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023) Lanzi and Loiacono [2023] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023)
- Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155 (2023) Sudhakaran et al. [2023] Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023)
- Sudhakaran, S., González-Duque, M., Glanois, C., Freiberger, M., Najarro, E., Risi, S.: MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) Helmuth and Kelly [2022] Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023)
- Helmuth, T., Kelly, P.: Applying genetic programming to psb2: the next generation program synthesis benchmark suite. Genetic Programming and Evolvable Machines 23(3), 375–404 (2022) Yao et al. [2023] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023)
- Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models (2023) Webson and Pavlick [2022] Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023)
- Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2300–2344. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.167 . https://aclanthology.org/2022.naacl-main.167 Lipkin et al. [2023] Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023) Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023)
- Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners (2023)
- Erik Hemberg (27 papers)
- Stephen Moskal (6 papers)
- Una-May O'Reilly (43 papers)