Search For advanced patterns in text. The common standard syntax are perl5™, however its old and kludgy. More interesting is the recently proposed perl6™ regexp syntax, which I plan to use as default regexp syntax in AEditor. At the moment there is a fully funtional perl5 parser frontend, and a very limited perl6 frontend. It is easy to write other parsers when/if needed.
placeholder.
Table 1. Ruby and Perl5 syntax
| atom atom | expr sequence. |
| expr | expr | alternation. |
| atom? | greedy repeat 0..1. |
| atom* | greedy repeat 0..infinity. |
| atom+ | greedy repeat 1..infinity. |
| atom{min,} | greedy repeat min..infinity. |
| atom{min,max} | greedy repeat min..max. |
| atom{min} | greedy repeat min..min. |
| atom?? | non-greedy repeat 0..1. |
| atom*? | non-greedy repeat 0..infinity. |
| atom+? | non-greedy repeat 1..infinity. |
| atom{min,}? | non-greedy repeat min..infinity. |
| atom{min,max}? | non-greedy repeat min..max. |
| atom{min}? | non-greedy repeat min..min. |
| \1 .. \9 | backreference. |
| ( expr ) | group. |
| (?: expr ) | pure group. |
| (?= expr ) | positive look ahead. |
| (?! expr ) | negative look ahead. |
| (?# ... ) | posix comment. |
| (?i: expr ), (?-i: expr ) | option-ignorecase=ON. option-ignorecase=OFF. Besides i, there are also x=extended, and m=multiline. |
| (?i), (?-i) | alternative way to toggle options. ignorecase=ON. ignorecase=OFF. |
| [ ... ] | character class. |
| [^ ... ] | inverse character class. |
| . | match everything, except newline. |
| \d | match digit. |
| \D | match anything else than digit. |
| \s | match space. |
| \S | match anything else than space. |
| \w | match word. |
| \W | match anything else than word. |
| ^ | beginning of line. |
| $ | end of line. |
| \A | beginning of string. |
| \z | end of string. |
| \Z | end of string, eventually exclude newline. |
| \b | boundary between word-blank. |
| \B | boundary between word-word or blank-blank. |
Simple optimizations is used at the moment.
left2right maximization. If we just did depth-first-search then it would maximize from right2left, which is very inefficient. By adding some more logic, then maximization occurs from left 2 right (efficient).
pathend if below minimum.
Eventual future optimizations.
fastmaps.
convert 'a*b' patterns into 'b(?<:a*)'.
instead of cloning, then use copy-on-write (COW) strategy.
binary branch when maximizing.
C++ implementation.
Nested repeats, empty expressions may cause endless loops, unless we detect that we are dealing with an endless loops (early).
Also when dealing with Repeats, we must attempt to maximize the left-most repeats first. At the moment we are doing the exact opposite (yields the same result, but very ineffective). This problem arises in this example:
/a(.*)b(.*)c/.match("0a1b2b3c4").to_a
#=> ["a1b2b3c", "1b2", "3"]
A simple inefficient recursive algorithm. It first reaches the left-most repeat. From there it tries to scan the following expression 'b(.*)c'. Thus the right-most repeat will be maximized before it again can return to the left-most repeat. It will then try to maximize the right-most repeat.. again! At some point the left-most repeat will be maximized.
abc - a.bc a.b.c a.b..c a.b...c OK a.b....c a.b.....c a..bc - a...bc a...b.c OK a...b..c a...b...c
The above senario is what I call: right2left maximization.
Lets talk about how left2right maximization occurs. First we try maximize the left-most repeat, add keep track of possible matches.
ab a.b remember a..b a...b remember a....b a.....b a......b
For each of the remembered positions (in reverse, because its greedy), we will try to maximize the second repeat.
a...bc a...b.c OK a...b..c a...b...c a...b....c
The remaining remembered positions can be discarded, because we now have the left-most-longest match.
The left-most repeat which we are attempting to maximize, uses a greedy algorithm. The remaining repeats, uses a strategy where they stop when they reaches the LAST node.
/a.*b.*c.*d/.match("abbcdcdd")We found 2 matches:
ab.cd a.bcd
Lets pick the last match (a.bcd) and try maximizing rep2.
We found 1 match:
a.b..cd
Lets try maximizing rep3.
We found 1 match:
a.b..c.d
Thats our final result.
placeholder.
/(a.*b){2}/.match("0a1b2ba3b4").to_a
#=> ["a1b2ba3b", "a3b"]
ab a.b OK a..b a...b OK a....b a.....b a......b OK a.......b a........b END
Next
a......bab END a...bab a...ba.b OK a...ba..b a...ba...b a...ba....b END
We found the right match.
An expression such as /(){2,}/ will loop forever unless we stop it. In this case there are 3 properties we may look at.
Are maximum equal infinity?
Does it have zero width?
Are index > minimum?
That will break the endless loop, and output ["", ""].
Let consider /()*/, with the above rules it will output ["", nil] which is wrong, the nil is incorrect. Because its a greedy repeat we expect the repeat to run at least one time, so the output would be ["", ""]. A sligth refinement of the rules is necessary.
Are index > one?
When all these 4 rules are satisfied we must stop looping, otherwise we will overwrite the content of our subcaptures with empty strings which is not desired. For instance see the difference between my engine and Oniguruma, decide for yourself what output is best.
assert_regex(["aaa", "a"], "a(a|)*", "aaabbb", :oniguruma_output=>["aaa", ""])
As you can see Oniguruma doesn't preserve capture[1], because the endless loop detection bails out too late, thus causing the capture to be cleared.