Linear Matching of JavaScript Regular Expressions (2311.17620v2)
Abstract: Modern regex languages have strayed far from well-understood traditional regular expressions: they include features that fundamentally transform the matching problem. In exchange for these features, modern regex engines at times suffer from exponential complexity blowups, a frequent source of denial-of-service vulnerabilities in JavaScript applications. Worse, regex semantics differ across languages, and the impact of these divergences on algorithmic design and worst-case matching complexity has seldom been investigated. This paper provides a novel perspective on JavaScript's regex semantics by identifying a larger-than-previously-understood subset of the language that can be matched with linear time guarantees. In the process, we discover several cases where state-of-the-art algorithms were either wrong (semantically incorrect), inefficient (suffering from superlinear complexity) or excessively restrictive (assuming certain features could not be matched linearly). We introduce novel algorithms to restore correctness and linear complexity. We further advance the state-of-the-art in linear regex matching by presenting the first nonbacktracking algorithms for matching lookarounds in linear time: one supporting captureless lookbehinds in any regex language, and another leveraging a JavaScript property to support unrestricted lookaheads and lookbehinds. Finally, we describe new time and space complexity tradeoffs for regex engines. All of our algorithms are practical: we validated them in a prototype implementation, and some have also been merged in the V8 JavaScript implementation used in Chrome and Node.js.
- Martin Berglund and Brink van der Merwe. 2017. Regular Expressions with Backreferences Re-examined. In Proceedings of the Prague Stringology Conference 2017, Prague, Czech Republic, August 28-30, 2017, Jan Holub and Jan Zdárek (Eds.). Department of Theoretical Computer Science, Faculty of Information Technology, Czech Technical University in Prague, 30--41. http://www.stringology.org/event/2017/p04.html
- Regular Expressions with Lookahead. J. Univers. Comput. Sci. 27, 4 (2021), 324--340. https://doi.org/10.3897/jucs.66330
- Angelo Borsotti and Ulya Trofimovich. 2021. Efficient POSIX submatch extraction on nondeterministic finite automata. Softw. Pract. Exp. 51, 2 (2021), 159--192. https://doi.org/10.1002/SPE.2881
- Carl Chapman and Kathryn T. Stolee. 2016. Exploring regular expression usage and context in Python. In Proceedings of the 25th International Symposium on Software Testing and Analysis, ISSTA 2016, Saarbrücken, Germany, July 18-20, 2016, Andreas Zeller and Abhik Roychoudhury (Eds.). ACM, 282--293. https://doi.org/10.1145/2931037.2931073
- Nariyoshi Chida and Tachio Terauchi. 2022. Repairing DoS Vulnerability of Real-World Regexes. In 43rd IEEE Symposium on Security and Privacy, SP 2022, San Francisco, CA, USA, May 22-26, 2022. IEEE, 2060--2077. https://doi.org/10.1109/SP46214.2022.9833597
- Nariyoshi Chida and Tachio Terauchi. 2023. Repairing Regular Expressions for Extraction. Proc. ACM Program. Lang. 7, PLDI (2023), 1633--1656. https://doi.org/10.1145/3591287
- Chromium. 2009. Irregexp, Google Chrome’s New Regexp Implementation. https://blog.chromium.org/2009/02/irregexp-google-chromes-new-regexp.html.
- Cloudflare. 2019. Details of the Cloudflare outage. https://blog.cloudflare.com/details-of-the-cloudflare-outage-on-july-2-2019/.
- Sylvain Conchon and Jean-Christophe Filliâtre. 2007. A persistent union-find data structure. In Proceedings of the ACM Workshop on ML, 2007, Freiburg, Germany, October 5, 2007, Claudio V. Russo and Derek Dreyer (Eds.). ACM, 37--46. https://doi.org/10.1145/1292535.1292541
- Russ Cox. 2007. Regular Expression Matching Can Be Simple And Fast. https://swtch.com/~rsc/regexp/regexp1.html.
- Russ Cox. 2009. Regular Expression Matching: the Virtual Machine Approach. https://swtch.com/~rsc/regexp/regexp2.html.
- The impact of regular expression denial of service (ReDoS) in practice: an empirical study at the ecosystem scale. In Proceedings of the 2018 ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, ESEC/SIGSOFT FSE 2018, Lake Buena Vista, FL, USA, November 04-09, 2018, Gary T. Leavens, Alessandro Garcia, and Corina S. Pasareanu (Eds.). ACM, 246--256. https://doi.org/10.1145/3236024.3236027
- Why Aren’t Regular Expressions a Lingua Franca? An Empirical Study on the Re-use and Portability of Regular Expressions. (2021). arXiv:2105.04397
- Using Selective Memoization to Defeat Regular Expression Denial of Service (ReDoS). In 42nd IEEE Symposium on Security and Privacy, SP 2021, San Francisco, CA, USA, 24-27 May 2021. IEEE, 1--17. https://doi.org/10.1109/SP40001.2021.00032
- Mark Jason Dominus. 2000. Perl Regular Expression Matching is NP-Hard. https://perl.plover.com/NPC/NPC-3SAT.html.
- DukTape. 2013. DukTape Regular Expressions. https://github.com/svaarala/duktape/blob/master/doc/regexp.rst.
- ECMA-262. 2022. RegExp (Regular Expression) Objects. https://262.ecma-international.org/13.0/#sec-regexp-regular-expression-objects.
- Bernd Finkbeiner and Henny Sipma. 2004. Checking Finite Traces Using Alternating Automata. Formal Methods Syst. Des. 24, 2 (2004), 101--127. https://doi.org/10.1023/B:FORM.0000017718.28096.48
- Andrew Gallant. 2014. Crate regex: An implementation of regular expressions for Rust. https://docs.rs/regex/latest/regex/.
- Andrew Gallant. 2023. Regex engine internals as a library. https://blog.burntsushi.net/regex-internals/.
- Wouter Gelade and Frank Neven. 2012. Succinctness of the Complement and Intersection of Regular Expressions. ACM Trans. Comput. Log. 13, 1 (2012), 4:1--4:19. https://doi.org/10.1145/2071368.2071372
- V. M. Gluškov. 1961. Abstract theory of automata. Uspehi Mat. Nauk 16, 5(101) (1961), 3--62.
- Google. 2022. RE2: A fast, safe, thread-friendly alternative to backtracking regular expression engines like those used in PCRE, Perl, and Python. https://github.com/google/re2.
- Google. 2023. RE2 Wiki. https://github.com/google/re2/wiki/Glossary.
- Hermes. 2022. Hermes Regex Engine. https://hermesengine.dev/docs/regexp/.
- Fast Matching of Regular Patterns with Synchronizing Counting. In Foundations of Software Science and Computation Structures - 26th International Conference, FoSSaCS 2023 (Lecture Notes in Computer Science, Vol. 13992), Orna Kupferman and Pawel Sobocinski (Eds.). Springer, 392--412. https://doi.org/10.1007/978-3-031-30829-1_19
- Iain Ireland. 2020. A New RegExp Engine in SpiderMonkey. https://hacks.mozilla.org/2020/06/a-new-regexp-engine-in-spidermonkey/.
- Static Analysis for Regular Expression Denial-of-Service Attacks. In Network and System Security - 7th International Conference, NSS 2013, Madrid, Spain, June 3-4, 2013. Proceedings (Lecture Notes in Computer Science, Vol. 7873), Javier López, Xinyi Huang, and Ravi S. Sandhu (Eds.). Springer, 135--148. https://doi.org/10.1007/978-3-642-38631-2_11
- Ville Laurikari. 2000. NFAs with Tagged Transitions, Their Conversion to Deterministic Automata and Application to Regular Expressions. In Seventh International Symposium on String Processing and Information Retrieval, SPIRE 2000, A Coruña, Spain, September 27-29, 2000, Pablo de la Fuente (Ed.). IEEE Computer Society, 181--187. https://doi.org/10.1109/SPIRE.2000.878194
- Revealer: Detecting and Exploiting Regular Expression Denial-of-Service Vulnerabilities. In 42nd IEEE Symposium on Security and Privacy, SP 2021, San Francisco, CA, USA, 24-27 May 2021. IEEE, 1468--1484. https://doi.org/10.1109/SP40001.2021.00062
- Derivative Based Nonbacktracking Real-World Regex Matching with Backtracking Semantics. Proc. ACM Program. Lang. 7, PLDI (2023), 1026--1049. https://doi.org/10.1145/3591262
- MuJS. 2014. MuJS Regex Engine. https://github.com/ccxvii/mujs/blob/master/regexp.c.
- Francesco Parolini and Antoine Miné. 2022. Sound Static Analysis of Regular Expressions for Vulnerabilities to Denial of Service Attacks. In Theoretical Aspects of Software Engineering - 16th International Symposium, TASE 2022, Cluj-Napoca, Romania, July 8-10, 2022, Proceedings (Lecture Notes in Computer Science, Vol. 13299), Yamine Aït Ameur and Florin Craciun (Eds.). Springer, 73--91. https://doi.org/10.1007/978-3-031-10363-6_6
- Rob Pike. 1987. The text editor sam. http://doc.cat-v.org/plan_9/4th_edition/papers/sam/.
- QuickJS. 2020. QuickJS Regex Engine. https://github.com/bellard/quickjs/blob/master/libregexp.c.
- RE2. 2017. GitHub Issue: Please Support Negative Lookahead. https://github.com/google/re2/issues/156.
- Markus L. Schmid. 2019. Regular Expressions with Backreferences: Polynomial-Time Matching Techniques. (2019). arXiv:1903.05896
- ReScue: crafting regular expression DoS attacks. In Proceedings of the 33rd ACM/IEEE International Conference on Automated Software Engineering, ASE 2018, Montpellier, France, September 3-7, 2018, Marianne Huchard, Christian Kästner, and Gordon Fraser (Eds.). ACM, 225--235. https://doi.org/10.1145/3238147.3238159
- Stack Exchange. 2016. Outage Postmortem. https://stackstatus.tumblr.com/post/147710624694/outage-postmortem-july-20-2016.
- Cristian-Alexandru Staicu and Michael Pradel. 2018. Freezing the Web: A Study of ReDoS Vulnerabilities in JavaScript-based Web Servers. In 27th USENIX Security Symposium, USENIX Security 2018, Baltimore, MD, USA, August 15-17, 2018, William Enck and Adrienne Porter Felt (Eds.). USENIX Association, 361--376. https://www.usenix.org/conference/usenixsecurity18/presentation/staicu
- Ken Thompson. 1968. Regular Expression Search Algorithm. Commun. ACM (1968). https://doi.org/10.1145/363347.363387
- TIOBE. 2023. Programming Community Index for April 2023. https://www.tiobe.com/tiobe-index/.
- Stephen Toub. 2022. Regular Expression Improvements in .NET 7. https://devblogs.microsoft.com/dotnet/regular-expression-improvements-in-dotnet-7/.
- Regex matching with counting-set automata. Proc. ACM Program. Lang. 4, OOPSLA (2020), 218:1--218:30. https://doi.org/10.1145/3428286
- V8. 2021. An Additional Non-backtracking RegExp Engine. https://v8.dev/blog/non-backtracking-regexp.
- Turning evil regexes harmless. In Proceedings of the South African Institute of Computer Scientists and Information Technologists, SAICSIT 2017, Thaba Nchu, South Africa, September 26-28, 2017, Muthoni Masinde (Ed.). ACM, 38:1--38:10. https://doi.org/10.1145/3129416.3129440
- WebKit. 2018. JavaScriptCore RegExp Processing. https://trac.webkit.org/wiki/JSCRegExpProcessingAndJSCGoals.
- Analyzing Matching Time Behavior of Backtracking Regular Expression Matchers by Using Ambiguity of NFA. In Implementation and Application of Automata - 21st International Conference, CIAA 2016, Seoul, South Korea, July 19-22, 2016, Proceedings (Lecture Notes in Computer Science, Vol. 9705), Yo-Sub Han and Kai Salomaa (Eds.). Springer, 322--334. https://doi.org/10.1007/978-3-319-40946-7_27
- Static Detection of DoS Vulnerabilities in Programs that Use Regular Expressions. In Tools and Algorithms for the Construction and Analysis of Systems - 23rd International Conference, TACAS 2017, Held as Part of the European Joint Conferences on Theory and Practice of Software, ETAPS 2017, Uppsala, Sweden, April 22-29, 2017, Proceedings, Part II (Lecture Notes in Computer Science, Vol. 10206), Axel Legay and Tiziana Margaria (Eds.). 3--20. https://doi.org/10.1007/978-3-662-54580-5_1