Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
97 tokens/sec
GPT-4o
53 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
5 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Correct and Optimal: the Regular Expression Inference Challenge (2308.07899v2)

Published 15 Aug 2023 in cs.LG, cs.CL, and cs.FL

Abstract: We propose regular expression inference (REI) as a challenge for code/LLMling, and the wider machine learning community. REI is a supervised ML and program optimisation task, and poses the problem of finding minimal regular expressions from examples: Given two finite sets of strings $P$ and $N$ and a cost function $cost(\cdot)$, the task is to generate an expression $r$ that accepts all strings in $P$ and rejects all strings in $N$, while no other such expression $r'$ exists with $cost(r')<cost(r)$. REI has advantages as a challenge problem: (i) regular expressions are well-known, widely used, and a natural idealisation of code; (ii) REI's asymptotic worst-case complexity is well understood; (iii) REI has a small number of easy to understand parameters (e.g. $P$ or $N$ cardinality, string lengths of examples, or the cost function); this lets us easily finetune REI-hardness; (iv) REI, with its emphasis on optimisation, is an unsolved problem for deep learning based ML. Recently, an REI solver was implemented on GPUs, using program synthesis techniques. This enabled, for the first time, fast generation of minimal regular expressions for complex REI instances. Building on this advance, we generate and publish the first large-scale datasets for REI, and devise and evaluate several initial heuristic and machine learning baselines. We invite the community to participate and explore ML methods that learn to solve REI problems. We believe that progress in REI directly translates to progress in code/LLMling.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (44)
  1. Dana Angluin. On the complexity of minimum inference of regular sets. Information and Control, 39(3):337–350, 1978.
  2. Dana Angluin. Learning Regular Sets from Queries and Counterexamples. Inf. Comput., 75(2):87–106, November 1987.
  3. Program synthesis with large language models. arXiv preprint arXiv:2108.07732, 2021.
  4. A large annotated corpus for learning natural language inference. In Proceedings of EMNLP 2015, pages 632–642, Lisbon, Portugal, September 2015.
  5. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
  6. MultiWOZ - a large-scale multi-domain Wizard-of-Oz dataset for task-oriented dialogue modelling. In Proceedings of EMNLP 2018, pages 5016–5026, Brussels, Belgium, October-November 2018.
  7. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374, 2021.
  8. Data extraction via semantic regular expression synthesis. Proc. ACM Program. Lang., 7(OOPSLA2), oct 2023.
  9. Noam Chomsky. Three models for the description of language. IRE Transactions on information theory, 2(3):113–124, 1956.
  10. The WebNLG Challenge: Generating Text from DBPedia Data. In Proceedings of the 9th international natural language generation conference, pages 163–167, 2016.
  11. Graph Structure and Monadic Second-Order Logic: A Language-Theoretic Approach. Cambridge University Press, USA, 1st edition, 2012.
  12. Neural networks and the chomsky hierarchy. In 11th International Conference on Learning Representations, 2023.
  13. A survey of grammatical inference methods for natural language learning. Artificial Intelligence Review, 36:1–27, 2011.
  14. Learning Deterministic Finite Automaton with a Recurrent Neural Network. In Proceedings of the ICGI 1998, ICGI ’98, page 90–101, Berlin, Heidelberg, 1998. Springer-Verlag.
  15. Github. Your AI pair programmer. https://github.com/features/copilot, 2022. Blog post accessed 8 June 2023.
  16. E. Mark Gold. Language identification in the limit. Information and Control, 10(5):447–474, 1967.
  17. E. Mark Gold. Complexity of automaton identification from given data. Information and Control, 37(3):302–320, 1978.
  18. OpenMEVA: A benchmark for evaluating open-ended story generation metrics. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 6394–6407, Online, August 2021.
  19. Measuring coding challenge competence with apps. NeurIPS, 2021.
  20. The curious case of neural text degeneration. In International Conference on Learning Representations, 2019.
  21. Introduction to Automata Theory, Languages, and Computation. Addison-Wesley, 2006.
  22. Cryptographic Limitations on Learning Boolean Formulae and Finite Automata. J. ACM, 41(1):67–95, jan 1994.
  23. S. C. Kleene. Representation of events in nerve nets and finite automata. In Automata Studies, pages 3–41. Princeton University Press, Princeton, NJ, 1956.
  24. TransRegex: Multi-Modal Regular Expression Synthesis by Generate-and-Repair. In Proceedings of the 43rd International Conference on Software Engineering, ICSE ’21, page 1210–1222. IEEE Press, 2021.
  25. Starcoder: may the source be with you! Transactions on Machine Learning Research, 2023.
  26. Neural generation of regular expressions from natural language with minimal domain knowledge. In Proceedings of EMNLP 2016, pages 1918–1923, Austin, Texas, November 2016.
  27. A systematic review of unsupervised approaches to grammar induction. Natural Language Engineering, 27(6):647–689, 2021.
  28. Adversarial NLI: A new benchmark for natural language understanding. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 4885–4901, Online, July 2020.
  29. OpenAI. Introducing ChatGPT. https://openai.com/blog/chatgpt, 2022. Blog post published November 30, 2022.
  30. SoftRegex: Generating Regex from Natural Language Descriptions using Softened Regex Equivalence. In Proceedings of EMNLP-IJCNLP 2019, pages 6425–6431, Hong Kong, China, November 2019.
  31. CodaLab Competitions: An open source platform to organize scientific challenges. Technical report, 2022.
  32. The Minimum Consistent DFA Problem Cannot Be Approximated within Any Polynomial. J. ACM, 40(1):95–142, jan 1993.
  33. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019.
  34. BERT Rediscovers the Classical NLP Pipeline. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 4593–4601, 2019.
  35. Creating a coding assistant with starcoder. Hugging Face Blog, 2023. https://huggingface.co/blog/starchat.
  36. Alan Mathison Turing et al. On computable numbers, with an application to the entscheidungsproblem. J. of Math, 58(345-363):5, 1936.
  37. Frits Vaandrager. Model Learning. Commun. ACM, 60(2):86–95, jan 2017.
  38. Search-Based Regular Expression Inference on a GPU. Proc. ACM Program. Lang., 7(PLDI), jun 2023. Draft available at https://arxiv.org/abs/2305.18575, implementation: https://github.com/MojtabaValizadeh/paresy.
  39. MLRegTest: A Benchmark for the Machine Learning of Regular Languages, 2023.
  40. Attention is all you need. Advances in neural information processing systems, 30, 2017.
  41. Superglue: A stickier benchmark for general-purpose language understanding systems. Advances in neural information processing systems, 32, 2019.
  42. A broad-coverage challenge corpus for sentence understanding through inference. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 1112–1122, New Orleans, Louisiana, June 2018.
  43. CodeT5: Identifier-aware Unified Pre-trained Encoder-Decoder Models for Code Understanding and Generation. In EMNLP, 2021.
  44. SemRegex: A Semantics-Based Approach for Generating Regular Expressions from Natural Language Specifications. In Proceedings of EMNLP 2018, pages 1608–1618, Brussels, Belgium, October-November 2018.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Mojtaba Valizadeh (4 papers)
  2. Philip John Gorinski (12 papers)
  3. Ignacio Iacobacci (24 papers)
  4. Martin Berger (22 papers)

Summary

We haven't generated a summary for this paper yet.