In the automata theory, a tagged deterministic finite automaton (TDFA) is an extension of deterministic finite automaton (DFA). In addition to solving the recognition problem for regular languages, TDFA is also capable of submatch extraction and parsing. While canonical DFA can find out if a string belongs to the language defined by a regular expression, TDFA can also extract substrings that match specific subexpressions. More generally, TDFA can identify positions in the input string that match tagged positions in a regular expression (tags are meta-symbols similar to capturing parentheses, but without the pairing requirement). == History == TDFA were first described by Ville Laurikari in 2000. Prior to that it was unknown whether it is possible to perform submatch extraction in one pass on a deterministic finite-state automaton, so this paper was an important advancement. Laurikari described TDFA construction and gave a proof that the determinization process terminates, however the algorithm did not handle disambiguation correctly. In 2007 Chris Kuklewicz implemented TDFA in a Haskell library Regex-TDFA with POSIX longest-match semantics. Kuklewicz gave an informal description of the algorithm and answered the principal question whether TDFA are capable of POSIX longest-match disambiguation, which was doubted by other researchers. In 2017 Ulya Trafimovich described TDFA with one-symbol lookahead. The use of a lookahead symbol reduces the number of registers and register operations in a TDFA, which makes it faster and often smaller than Laurikari TDFA. Trafimovich called TDFA variants with and without lookahead TDFA(1) and TDFA(0) by analogy with LR parsers LR(1) and LR(0). The algorithm was implemented in the open-source lexer generator RE2C. Trafimovich formalized Kuklewicz disambiguation algorithm. In 2018 Angelo Borsotti worked on an experimental Java implementation of TDFA; it was published later in 2021. In 2019 Borsotti and Trafimovich adapted POSIX disambiguation algorithm by Okui and Suzuki to TDFA. They gave a formal proof of correctness of the new algorithm and showed that it is faster than Kuklewicz algorithm in practice. In 2020 Trafimovich published an article about TDFA implementation in RE2C. In 2022 Borsotti and Trafimovich published a paper with a detailed description of TDFA construction. The paper incorporated their past research and presented multi-pass TDFA that are better suited to just-in-time determinization. They also compared TDFA against other algorithms and provided benchmarks. == Formal definition == TDFA have the same basic structure as ordinary DFA: a finite set of states linked by transitions. In addition to that, TDFA have a fixed set of registers that hold tag values, and register operations on transitions that set or copy register values. The values may be scalar offsets, or offset lists for tags that match repeatedly (the latter can be represented efficiently using a trie structure). There is no one-to-one mapping between tags in a regular expression and registers in a TDFA: a single tag may need many registers, and the same register may hold values of different tags. The following definition is according to Trafimovich and Borsotti. The original definition by Laurikari is slightly different. A tagged deterministic finite automaton F {\displaystyle F} is a tuple ( Σ , T , S , S f , s 0 , R , R f , δ , φ ) {\displaystyle (\Sigma ,T,S,S_{f},s_{0},R,R_{f},\delta ,\varphi )} , where: Σ {\displaystyle \Sigma } is a finite set of symbols (alphabet) T {\displaystyle T} is a finite set of tags S {\displaystyle S} is a finite set of states with initial state s 0 {\displaystyle s_{0}} and a subset of final states S f ⊆ S {\displaystyle S_{f}\subseteq S} R {\displaystyle R} is a finite set of registers with a subset of final registers R f {\displaystyle R_{f}} (one per tag) δ : S × Σ → S × O ∗ {\displaystyle \delta :S\times \Sigma \rightarrow S\times O^{}} is a transition function φ : S f → O ∗ {\displaystyle \varphi :S_{f}\rightarrow O^{}} is a final function, where O {\displaystyle O} is a set of register operations of the following types: set register i {\displaystyle i} to nil or to the current position: i ← v {\displaystyle i\leftarrow v} , where v ∈ { n , p } {\displaystyle v\in \{\mathbf {n} ,\mathbf {p} \}} copy register j {\displaystyle j} to register i {\displaystyle i} : i ← j {\displaystyle i\leftarrow j} copy register j {\displaystyle j} to register i {\displaystyle i} and append history: i ← j ⋅ h {\displaystyle i\leftarrow j\cdot h} , where h {\displaystyle h} is a string over { n , p } {\displaystyle \{\mathbf {n} ,\mathbf {p} \}} === Example === Figure 0 shows an example TDFA for regular expression ( 1 a 2 ) ∗ 3 ( a | 4 b ) 5 b ∗ {\displaystyle (1a2)^{}3(a|4b)5b^{}} with alphabet Σ = { a , b } {\displaystyle \Sigma =\{a,b\}} and a set of tags T = { 1 , 2 , 3 , 4 , 5 } {\displaystyle T=\{1,2,3,4,5\}} that matches strings of the form a … a b … b {\displaystyle a\dots ab\dots b} with at least one symbol. TDFA has four states S = { 0 , 1 , 2 , 3 } {\displaystyle S=\{0,1,2,3\}} three of which are final S f = { 1 , 2 , 3 } {\displaystyle S_{f}=\{1,2,3\}} . The set of registers is R = { r 1 , r 2 , r 3 , r 4 , r 5 } {\displaystyle R=\{r_{1},r_{2},r_{3},r_{4},r_{5}\}} with a subset of final registers R f = { r 1 , r 2 , r 3 , r 4 , r 5 } {\displaystyle R_{f}=\{r_{1},r_{2},r_{3},r_{4},r_{5}\}} where register r i {\displaystyle r_{i}} corresponds to i {\displaystyle i} -th tag. Transitions have operations defined by the δ {\displaystyle \delta } function, and final states have operations defined by the φ {\displaystyle \varphi } function (marked with wide-tipped arrow). For example, to match string a a b {\displaystyle aab} , one starts in state 0, matches the first a {\displaystyle a} and moves to state 1 (setting registers r 1 , r 2 {\displaystyle r_{1},r_{2}} to undefined and r 3 {\displaystyle r_{3}} to the current position 0), matches the second a {\displaystyle a} and loops to state 1 (register values are now r 1 = 0 , r 2 = r 3 = 1 {\displaystyle r_{1}=0,r_{2}=r_{3}=1} ), matches b {\displaystyle b} and moves to state 2 (register values are now r 1 = 1 , r 2 = r 3 = r 4 = 2 {\displaystyle r_{1}=1,r_{2}=r_{3}=r_{4}=2} ), executes the final operations in state 2 (register values are now r 1 = 1 , r 2 = r 3 = r 4 = 2 , r 5 = 3 {\displaystyle r_{1}=1,r_{2}=r_{3}=r_{4}=2,r_{5}=3} ) and finally exits TDFA. == Complexity == Canonical DFA solve the recognition problem in linear time. The same holds for TDFA, since the number of registers and register operations is fixed and depends only on the regular expression, but not on the length of input. The overhead on submatch extraction depends on tag density in a regular expression and nondeterminism degree of each tag (the maximum number of registers needed to track all possible values of the tag in a single TDFA state). On one extreme, if there are no tags, a TDFA is identical to a canonical DFA. On the other extreme, if every subexpression is tagged, a TDFA effectively performs full parsing and has many operations on every transition. In practice for real-world regular expressions with a few submatch groups the overhead is negligible compared to matching with canonical DFA. == TDFA construction == TDFA construction is performed in a few steps. First, a regular expression is converted to a tagged nondeterministic finite automaton (TNFA). Second, a TNFA is converted to a TDFA using a determinization procedure; this step also includes disambiguation that resolves conflicts between ambiguous TNFA paths. After that, a TDFA can optionally go through a number of optimizations that reduce the number of registers and operations, including minimization that reduces the number of states. Algorithms for all steps of TDFA construction with pseudocode are given in the paper by Borsotti and Trafimovich. This section explains TDFA construction on the example of a regular expression a ∗ t b ∗ | a b {\displaystyle a^{}tb^{}|ab} , where t {\displaystyle t} is a tag and { a , b } {\displaystyle \{a,b\}} are alphabet symbols. === Tagged NFA === TNFA is a nondeterministic finite automaton with tagged ε-transitions. It was first described by Laurikari, although similar constructions were known much earlier as Mealy machines and nondeterministic finite-state transducers. TNFA construction is very similar to Thompson's construction: it mirrors the structure of a regular expression. Importantly, TNFA preserves ambiguity in a regular expression: if it is possible to match a string in two different ways, then TNFA for this regular expression has two different accepting paths for this string. TNFA definition by Borsotti and Trafimovich differs from the original one by Laurikari in that TNFA can have negative tags on transitions: they are needed to make the absence of match explicit in cases when there is a bypass for a tagged transition. Figure 1 shows TNFA for the example regu
Read more →