Engineering faster double-array Aho-Corasick automata (2207.13870v3)

Published 28 Jul 2022 in cs.DS

Abstract: Multiple pattern matching in strings is a fundamental problem in text processing applications such as regular expressions or tokenization. This paper studies efficient implementations of double-array Aho-Corasick automata (DAACs), data structures for quickly performing the multiple pattern matching. The practical performance of DAACs is improved by carefully designing the data structure, and many implementation techniques have been proposed thus far. A problem in DAACs is that their ideas are not aggregated. Since comprehensive descriptions and experimental analyses are unavailable, engineers face difficulties in implementing an efficient DAAC. In this paper, we review implementation techniques for DAACs and provide a comprehensive description of them. We also propose several new techniques for further improvement. We conduct exhaustive experiments through real-world datasets and reveal the best combination of techniques to achieve a higher performance in DAACs. The best combination is different from those used in the most popular libraries of DAACs, which demonstrates that their performance can be further enhanced. On the basis of our experimental analysis, we developed a new Rust library for fast multiple pattern matching using DAACs, named Daachorse, as open-source software at https://github.com/daac-tools/daachorse. Experiments demonstrate that Daachorse outperforms other AC-automaton implementations, indicating its suitability as a fast alternative for multiple pattern matching in many applications.

Citations (3)

View on Semantic Scholar

Summary

The paper presents novel DAAC optimization techniques that reduce memory overhead and accelerate multiple pattern matching.
It combines advanced memory layouts and traversal strategies to achieve up to 2.6× faster execution in tokenizer tests.
The open-source tool Daachorse demonstrates practical benefits for efficient text processing in search engines and data mining.

Engineering Faster Double-Array Aho-Corasick Automata

The paper "Engineering Faster Double-Array Aho–Corasick Automata" by Shunsuke Kanda and colleagues tackles the notable challenge of optimizing Double-Array Aho-Corasick automata (DAACs) for multiple pattern matching. This problem is inherently pertinent to text processing and computational linguistics, where efficiency largely dictates the feasibility of large-scale and real-time applications. The authors present a comprehensive paper on DAAC implementation techniques to improve both time and space efficiencies, a knowledge gap that has previously impaired the practical application of this data structure in fast multiple pattern matching.

Key Contributions and Techniques

The authors enumerate and experimentally evaluate the various implementation techniques critical to DAACs, arriving at an optimal combination suited for different scenarios. The external library developed as part of this work, Daachorse, incorporates these learnings to serve as an open-source solution for applications requiring efficient pattern matching. Several critical techniques reviewed and introduced in this body of work include:

Approaches to Manage Output Sets: The paper contrasts the Simple, Shared, and Forest methods for storing output sets. The Forest method, while introducing complexity to access operations, significantly reduces memory overhead by eliminating redundancy in stored patterns.
Handling Multibyte Characters: The paper investigates Bytewise, Charwise, and Mapped schemes for processing strings with multibyte characters. Notably, the Mapped scheme reduces the number of vacant identifiers by intelligently resolving code-point values to facilitate faster pattern matching with fewer runtime checks.
Memory Layout and Data-Efficient Structures: The Packed layout is shown to possess improved cache performance over Individual layout arrangements due to better data locality, which substantially reduces cache miss rates during execution. Additionally, a Compact format where CHECK arrays are byte-sized improves space efficiency without sacrificing significant performance.
Accelerating Vacant Searches: The paper innovatively combines techniques like SkipForward and SkipDense to handle vacant IDs more effectively during the construction phase, emphasizing a balance between computational complexity and memory usage.
Traversal Strategies: The insights into LexDFS outperforming other traversal strategies in specific contexts—due to better cache coherency for frequently accessed deep states—demonstrate the importance of selecting appropriate traversal orders in minimizing computational overhead.

Empirical Evaluations and Results

The exhaustive experimental evaluation using real-world datasets demonstrates that the proposed combination of methods yields superior performance in both time and space efficiencies compared to existing implementations. The results confirmed that Daachorse is significantly more efficient, with up to 2.6× faster execution times in practical tokenizer implementations like Vaporetto. These performance improvements were consistent across various text corpora and application demands, illustrating the robustness of the approach across diverse linguistic datasets.

Practical and Theoretical Implications

Practically, the implications of this research are immediately applicable to software systems performing large-scale text processing, such as search engines and data mining systems that require rapid and memory-efficient text pattern matching. Theoretically, the merging of data structure optimization with advanced traversal and search strategies presents a significant advancement in the field of string algorithms and data handling in computational linguistics.

Future Work

The paper also opens avenues for further research, particularly in adapting similar optimization frameworks to compressed and succinct representations of AC automata, which could offer balance between storage constraints and processing efficiency, tailored to applications with unique execution environments or restrictions, such as mobile devices or embedded systems. Enhanced context-specific adjustments and applications to other pattern-dependent tasks are a potential exploration vector that could extend the influence of these findings.

In summation, this work represents a comprehensive enhancement to DAACs, addressing longstanding issues of inefficient implementation that have limited their utility in practical applications. The methodologies outlined are likely to stimulate further innovation and refinement in the development of efficient text processing algorithms.

PDF Markdown

Follow-up Questions

We haven't generated follow-up questions for this paper yet.

Generate Now

Related Papers

Authors (3)

GitHub

GitHub - daac-tools/daachorse: 🐎 A fast implementation of the Aho-Corasick algorithm using the compact double-array data structure in Rust. (198 stars)

Tweets

https://twitter.com/kampersanda/status/1552846579803258880

https://twitter.com/odashi_en/status/1628338160701943808

https://twitter.com/kampersanda/status/1552899387390840832

https://twitter.com/AlgorithmPapers/status/1805599608246075440

https://twitter.com/MintoAoyama/status/1572774950598950913