Maxime Coste
f07375fb27
Regex: remove dead code
2017-11-01 14:05:15 +08:00
Maxime Coste
2c2073b417
Regex: Tweak struct layouts of ParsedRegex data
2017-11-01 14:05:15 +08:00
Maxime Coste
bbd7e604dc
Regex: Remove "Ast" from names in the ParsedRegex
...
It does not add much value, and makes names longer.
2017-11-01 14:05:15 +08:00
Maxime Coste
18a02ccacd
Regex: Optimize parsing and compilation
...
AstNodes are now POD, stored in a single vector, accessed through
their index. The children list is implicit, with nodes storing only
the node index at which their child graph ends.
That makes reverse iteration slower, but that is only used for reverse
matching regex, which are uncommon. In the general case compilation
is now faster.
2017-11-01 14:05:15 +08:00
Maxime Coste
aea2de885d
Regex: minor cleanup of the regex parsing code
2017-11-01 14:05:15 +08:00
Maxime Coste
6e0275e550
Regex: small code cleanup in the Save compilation code
2017-11-01 14:05:15 +08:00
Maxime Coste
9e15207d2a
Regex: put the other char boolean inside the general start char map
2017-11-01 14:05:15 +08:00
Maxime Coste
7c3bc48627
Fix ConstexprVector::resize
2017-11-01 14:05:15 +08:00
Maxime Coste
60e32d73ff
Regex: Fix handling of all unicode codepoint as start chars
2017-11-01 14:05:15 +08:00
Maxime Coste
df2bf9601c
Regex: fix wrong fallthough in dump_regex
2017-11-01 14:05:15 +08:00
Maxime Coste
e9e9a08e7b
Regex: refactor handling of Saves slightly, do not create them until really needed
2017-11-01 14:05:15 +08:00
Maxime Coste
d9b4076e3c
Regex: Go back to instruction based search of next start
...
The previous method, which was a bit faster in the general use case,
can hit some cases where we get quadratic behaviour and very slow
matching.
By using an instruction, we can guarantee our complexity of O(N*M)
as we will never have more than N threads (N being the instruction
count) and we run the threads once per codepoint in the subject
string.
That slows down the general case slightly, but ensure we dont have
pathological cases.
This new version is much faster than the previous instruction based
search because it does not use a plain `.*` searcher, but a specific,
smarter instruction specialized for finding the next start if we are
in the correct conditions.
2017-11-01 14:05:15 +08:00
Maxime Coste
3f627058b0
Regex: add support for \0, \cX, \xXX and \uXXXX escapes
2017-11-01 14:05:15 +08:00
Maxime Coste
c423b47109
Regex: compute if codepoints outside of the start chars map can start
2017-11-01 14:05:15 +08:00
Maxime Coste
2c6c0be0c1
Regex: abort compilation as soon as we hit the instruction count limit
2017-11-01 14:05:15 +08:00
Maxime Coste
d44e160aa7
Regex: add a unit test for why lookaheads dont count for start chars anymore
2017-11-01 14:05:15 +08:00
Maxime Coste
87eec79d07
Regex: comment the mutables in CompiledRegex::Instruction and fix their init
2017-11-01 14:05:14 +08:00
Maxime Coste
8b2297f5ca
Regex: Introduce a Regex memory domain to track usage separately
2017-11-01 14:05:14 +08:00
Maxime Coste
9ec175f2f8
Regex: use binary search to for character class ranges check
2017-11-01 14:05:14 +08:00
Maxime Coste
6e65589a34
Regex: compute start chars from matchers, do not compute it from lookarounds
...
Computing potential start characters from lookarounds is more complex
than expected, and not worth the complexity.
2017-11-01 14:05:14 +08:00
Maxime Coste
621b0d3ab8
Regex: remove the need to a processed inst vector
...
Identify each step with a counter, and check if the instruction
was already processed this step. This makes the matching faster,
by removing the need to maintain a vector of instructions executed
this step.
2017-11-01 14:05:14 +08:00
Maxime Coste
cfc52d7e6a
Regex: use intrusive linked list for the free saves instead of a Vector
2017-11-01 14:05:14 +08:00
Maxime Coste
df16fea82d
Regex: rename "flags" with the more common "modifiers"
2017-11-01 14:05:14 +08:00
Maxime Coste
52d443f764
Regex: Correctly handle ignore case mode for start chars computation
2017-11-01 14:05:14 +08:00
Maxime Coste
b8495f0953
Regex: Rework parsing, treat lookarounds as assertions, and flags separately
2017-11-01 14:05:14 +08:00
Maxime Coste
b0233262b8
Regex: Limit programs to std::numeric_limits<uint16_t>::max() instructions
2017-11-01 14:05:14 +08:00
Maxime Coste
8c8dcb3a84
Regex: Fix reverse searching behaviour, again
2017-11-01 14:05:14 +08:00
Maxime Coste
9753bcd0ad
Regex: limit explicit quantifiers value (too 1000 for now)
...
Fixes #1628
2017-11-01 14:05:14 +08:00
Maxime Coste
2b97e4e124
Regex: Fix handling of ^ and $ in backward matching mode
2017-11-01 14:05:14 +08:00
Maxime Coste
3c999aba37
Regex: Only reset processed and scheduled flags on relevant instructions
...
On big regex, reseting all those flags on all instructions for each
character can become the dominant operation. Track that actual
instructions index processed (the scheduled are already tracked in
the next_threads vector), and only reset these.
2017-11-01 14:05:14 +08:00
Maxime Coste
5bf4be645a
Regex: Fix support for ignore case in lookarounds
2017-11-01 14:05:14 +08:00
Maxime Coste
80f6caee81
Regex: move try/catch blocks inside boost specific code
2017-11-01 14:05:14 +08:00
Maxime Coste
dd9e43e6f9
Regex: small code cleanup
2017-11-01 14:05:14 +08:00
Maxime Coste
23b3a221eb
Regex: support more than two children in alternations
...
Avoid deep nested alternations, parse them flattened.
2017-11-01 14:05:14 +08:00
Maxime Coste
fb5243f710
Regex: print instruction index in dump_regex
2017-11-01 14:05:14 +08:00
Maxime Coste
c8966ca701
Regex: Assert that the regex direction matches the vm direction
2017-11-01 14:05:14 +08:00
Maxime Coste
74ed102cab
Regex: Tweak definition of character class and control escape tables
2017-11-01 14:05:14 +08:00
Maxime Coste
df73b71dfc
Regex: fix lookarounds handling when computing starting chars
2017-11-01 14:05:14 +08:00
Maxime Coste
1c95074657
Make use of custom regex backward searching support for reverse search
2017-11-01 14:05:14 +08:00
Maxime Coste
785cd34b4b
Regex: Make boost checking disableable at compile time
2017-11-01 14:05:14 +08:00
Maxime Coste
065bbc8f59
Regex: switch to custom impl, use boost for checking
2017-11-01 14:05:14 +08:00
Maxime Coste
9305fa1369
Regex: Fix lookaround use in moon.kak
...
(?=[A-Z]\w*) is strictly the same as (?=[A-Z]) as \w* will always
at least match an empty string.
2017-11-01 14:05:14 +08:00
Maxime Coste
cca730193c
Regex: Support any char and character classes in lookarounds
...
Lookarounds still need to be fixed size, but accept character classes
as well as plain literals.
2017-11-01 14:05:14 +08:00
Maxime Coste
b8cb65160a
Regex: use std::conditional instead of custom template class to choose Utf8It
2017-11-01 14:05:14 +08:00
Maxime Coste
db06acdfab
Regex: Fix computation of potential starts for lookaheads
2017-11-01 14:05:14 +08:00
Maxime Coste
34b1f1ccb6
Regex: detect when all characters can start and avoid allocating
2017-11-01 14:05:14 +08:00
Maxime Coste
ea85f79384
Regex: add elided braces to fix compilation on older gcc
2017-11-01 14:05:14 +08:00
Maxime Coste
bf3b50a543
Regex: Fix wrong size of character_class_escapes array
2017-11-01 14:05:14 +08:00
Maxime Coste
08ea68dc1f
Regex: Fix handling of match_prev_avail for boost regex
...
We were passing around iterators that were not allowed to
go before the begin iterator.
2017-11-01 14:05:14 +08:00
Maxime Coste
9ec376135b
Regex: Introduce RegexExecFlags::PrevAvailable
...
Rework assertion code as well.
2017-11-01 14:05:14 +08:00