Sequence Datalog: Querying Sequence Data
- Sequence Datalog is an extension of classical Datalog designed for querying sequence databases using strings, paths, or sequences instead of atomic tuples.
- It incorporates advanced features such as negation, recursion, intermediate predicates, and path concatenation to express complex queries.
- A systematic study of its six core features distinguishes between primitive and redundant constructs, guiding efficient design of sequence query engines.
Sequence Datalog is an extension of the classical Datalog language designed for querying sequence databases where the atomic units of information are strings, paths, or sequences, rather than tuples of atomic values. By enriching Datalog with path expressions built through concatenation and supporting features such as negation, recursion, intermediate predicates, higher arity relations, equations, and packing, Sequence Datalog provides a uniform logic-programming framework capable of expressing complex queries over sequences. The expressiveness of this language has been precisely characterized through a systematic study of its six core features, yielding a refined hierarchy of language fragments and clarifying their interplay and redundancy (Aamer et al., 2022).
1. Formal Syntax and Semantic Foundations
Let denote a countable set of atomic symbols. The definitions are as follows:
- Values and Paths: Each is a value; if is a value, then is a packed value. A path is any finite concatenation of values, including the empty path .
- Variables: Atomic variables (notated @x) range over atomic symbols; path variables (e ::= a \mid \text{@x} \mid \$x \mid \langle e \rangle \mid e_1 \cdot e_2 Rne_1,\ldots,e_na \in \Sigma$0 is a predicate atom; $a \in \Sigma$1 denotes an equation.
- Literals and Rules: Literals are (possibly negated) atomic predicates or equations. Rules take the form $a \in \Sigma$2 where $a \in \Sigma$3 is a set of literals and $a \in \Sigma$4 is a predicate. A program is a finite sequence of stratified rule sets (āstrataā), allowing stratified negation.
- Semantics: An instance assigns to each $a \in \Sigma$5 a finite relation over paths. Valuations $a \in \Sigma$6 map variables to values or paths. Satisfaction $a \in \Sigma$7 holds if $a \in \Sigma$8 for predicates or $a \in \Sigma$9 for equations.
2. Orthogonal Language Features
Six language features are identified as orthogonal axes along which the expressivity and structural complexity of Sequence Datalog fragments can be analyzed:
| Feature | Notation | Description |
|---|---|---|
| Negation | N | Use of stratified negation in rule bodies |
| Recursion | R | Recursive (cyclic) rule dependencies |
| Intermediate Predicates | I | More than one IDB predicate defined |
| Arity | A | Use of predicates with arity $v$0 |
| Equations | E | Equality/inequality of path expressions |
| Packing | P | Use of the packing operator $v$1 |
Negation and recursion correspond to classical Datalog extensions. Intermediate predicates describe programs with non-flat structure. Arity allows for non-unary relations. Equations permit direct matching or constraints on sequences. Packing enables subsequences to be treated as atomic units.
3. Expressiveness Hierarchy: Redundancy and Primitivity Results
The rigorous analysis in (Aamer et al., 2022) demonstrates which features are strictly required (āprimitiveā) and which are always or sometimes redundant in the presence of others:
- Arity (A) is always redundant: Any use of predicates of arity $v$2 can be simulated via unary predicates, packing, and a fresh separator symbol, with supporting equations to handle parsing of packed values.
- Packing (P) is always redundant: Packed values can always be simulated via concatenation with delimiters and, in recursive contexts, output-undoubling constructions from J-Logic, relying on arity (already known redundant).
- Equations (E) are redundant given both Negation (N) and Intermediate predicates (I): All uses of equality/inequality can be encoded with auxiliary predicates and stratified negation.
- Intermediate predicates (I) are redundant absent both N and R: In positive, non-recursive, flat programs, all predicates can be inlined into the heads of rules via equations or packing.
- Negation (N) is primitive: It fundamentally enables non-monotone queries, such as set difference, which are not expressible by positive programs alone.
- Recursion (R) is primitive: Only recursive programs can express queries generating outputs of super-linear length with respect to input, such as computing $v$3 from $vv$5. On flat unary instances with monadic schemas, two fragments may be equivalent in expressive power ($v$6), or one may strictly dominate the other. The expressiveness lattice, considering redundancy results, is as follows (arrows denote strict containment):
- $v$7: purely positive, nonrecursive, monadic
- $v$8
- $v$9
- $\langle v \rangle$0, $\langle v \rangle$1
- $\langle v \rangle$2
- $\langle v \rangle$3 (the most expressive)
Packing (P) and arity (A) do not increase expressive power, and fragments distinguished only by these features collapse together. The top of the lattice is the fragment permitting all features; otherwise, the key āleversā are negation and recursion.
5. Illustrative Patterns and Canonical Examples
Several prototypical queries illustrate the use and necessity of Sequence Datalog features:
- NFA Acceptance (Recursion):
$\langle v \rangle$6
- All-aās Test (Equation):
$\langle v \rangle$7
- Subsequence Packing (Packing Operator):
$\langle v \rangle$8
- Reversal Without Arity: Encoding $\langle v \rangle$4 as a single bracketed value $\langle v \rangle$5 and using equations allows reversal constructions without genuinely binary predicates.
These examples underpin the theoretical results by showing which features are exploited and how simulating them with weaker fragments fails.
6. Design Implications and Practical Considerations
The expressiveness characterization directly informs the design of sequence query engines:
- Arity and packing can be omitted from implementations without loss of generality, as their apparent expressive contributions are always redundant.
- Equations are essential for concise expression of pattern tests but offer no gain if both negation and intermediate predicates are already present.
- Intermediate predicates can be excluded in positive, non-recursive settings (flat programs) but are indispensable with negation or recursion.
- Recursion and stratified negation are the primary sources of increased expressive power, fundamentally enlarging the class of queries that can be defined.
A plausible implication is that practical systems may maximize implementational tractability by focusing support on recursion and stratified negation while minimizing or syntactically eliminating features like arity and packing.
7. Research Significance and Outlook
The systematic analysis of Sequence Datalog features provides a comprehensive map of all potential language fragments, their expressive capabilities, and mutual simulations or strict separations. The results clarify longstanding questions about the necessity of various logic-programming extensions for sequence data and identify where expressive ājumpsā actually occur. For sequence-centric applicationsāincluding process mining, information extraction, and modern graph/path query tasksāthese insights enable meaningful language design choices and guide the implementation of efficient query engines attuned to the true requirements of their domains (Aamer et al., 2022).
References (1)1.