During my search for a way to integrate inverse references efficiently (using tracking) into a text-driven regular expression engine, I found a document entitled Extending Finite Automata to efficiently combine regular expressions compatible with Perl that seems to address this problem. (The PDF can be found at your favorite science center).

This is what I understand of the proposed algorithm.

- It is based on Thompson's NFA algorithm.
- A list of substring sets is associated with each active state. Each set contains strings that match a particular pair of parentheses. There is a set per capture parenthesis.
- When a reverse reference is found, one letter of each word is removed from the set of substrings.
- The elements of the set whose first letter does not match are completely removed.
- When an element of the set is empty, the transition labeled
`one`

Actually it crosses.

If I'm not wrong, this algorithm would match `(a | ab) (bc | c) 1 2`

with the rope `abc ab bc`

, which obviously is incorrect. I think that this algorithm does not take into account that the sub-coincidences interact with each other.

In addition, it seems to be a polynomial algorithm. Although it is known that the coincidence of regular expressions with inverse references is NP-hard.

Did I miss something and the algorithm is correct? Or is the proposed algorithm really incorrect?

Is there a way to fix it without storing an exponential combination of sub-parches?