Maxime Coste
d652ec9ce1
Cleanup regex lookarounds implementation and reject incompatible regex
...
Fixes #2487
2018-10-10 22:47:59 +11:00
Maxime Coste
cde0c51cd6
Tweak comment to make it less ambiguous
2018-07-08 16:58:19 +10:00
Olivier Perret
67655de947
Use a dedicated vm op for dot when match-newline is false
2018-06-24 12:41:50 +02:00
Olivier Perret
b5ee1db1c4
Use bit-flags for storing regex regex options
2018-06-24 12:41:50 +02:00
Olivier Perret
8edef8b3f1
Add support for regex flag to toggle dot-matches-newline
2018-06-24 12:41:50 +02:00
Maxime Coste
1fb53ca712
Fix wrong use of constexpr
2018-04-30 07:41:31 +10:00
Maxime Coste
1e8026f143
Regex: Use only 128 characters in start desc and encode others as 0
...
Using 257 was using lots of memory for no good reason, as > 127
codepoint are not common enough to be treated specially.
2018-04-29 19:58:18 +10:00
Maxime Coste
a1b8864c77
Merge remote-tracking branch 'lenormf/regex-format-string' into HEAD
2018-04-28 09:29:57 +10:00
Maxime Coste
2b9ec411d3
fix potential overflow in dump_regex
2018-04-28 09:29:15 +10:00
Frank LENORMAND
9bac04d35f
regex_impl: Fix a potential format string flaw
2018-04-27 09:24:22 +03:00
Maxime Coste
8438b33175
Add a debug regex command to dump regex instructions
2018-04-27 08:35:09 +10:00
Maxime Coste
f10eb9faa3
Use indices instead of pointers for saves/instruction in ThreadedRegexVM
...
Performance seems unaffacted, but memory usage should be lowered
as the Thread struct is 4 bytes instead of 16.
2018-04-27 08:35:09 +10:00
Maxime Coste
71a1893a5e
Fix some trailing spaces and a tab that sneaked into the code base
2018-04-05 08:52:33 +10:00
Maxime Coste
b27d4afa8d
Regex: Only allow SyntaxCharacter and - to be escaped in a character class
...
Letting any character to be escaped is error prone as it looks like
\l could mean [:lower:] (as it used to with boost) when it only means
literal l.
Fix the haskell.kak file as well.
Fixes #1945
2018-03-20 04:57:47 +11:00
Maxime Coste
fb65fa60f8
Regex: take the full subject range as a parameter
...
To allow more general look arounds out of the actual search range,
pass a second range (the actual subject). This allows us to remove
various flags such as PrevAvailable or NotBeginOfSubject, which are
now easy to check from the subject range.
Fixes #1902
2018-03-05 05:48:10 +11:00
Maxime Coste
933ac4d3d5
Regex: Improve comments and constify some variables
...
Reword various comments to make some tricky parts of the regex
engine easier to understand.
2018-02-24 17:40:08 +11:00
Maxime Coste
3584e00d19
Regex: Use a template argument instead of a regular one for "forward"
...
forward (which controls if we are compling for forward or backward
matching) is always statically known, and compilation will first
compile forward, then backward (if needed), so by having separate
compiled function we get rid of runtime branches.
2018-02-09 22:45:53 +11:00
Maxime Coste
aa9f7753e8
Regex: minor code cleanup
2018-02-09 22:19:56 +11:00
Maxime Coste
413f880e9e
Regex: Support forward and backward matching code in the same CompiledRegex
...
No need to have two separate regexes to handle forward and backward
matching, just passing RegexCompileFlags::Backward will add support
for backward matching to the regex. For backward only regex, pass
RegexCompileFlags::NoForward as well to disable generation of
forward matching code.
2017-12-01 19:57:02 +08:00
Maxime Coste
7bfb695c45
Regex: Do not allow private use codepoints literals
...
We use them to encode non-literals in lookarounds, so they can
trigger bugs.
Fixes #1737
2017-12-01 16:37:18 +08:00
Maxime Coste
65b057f261
Regex: rename StartChars to StartDesc
...
It only contains chars for now, but its still more generally
describing where matches can start.
2017-12-01 14:46:18 +08:00
Maxime Coste
b91f43b031
Regex: optimize parsing a bit
2017-11-30 14:32:29 +08:00
Maxime Coste
c1f0efa3f4
Regex: smarter handling of start chars computation for character class
2017-11-30 14:19:41 +08:00
Maxime Coste
ae0911b533
Regex: Various small code tweaks
2017-11-28 01:03:54 +08:00
Maxime Coste
4598832ed5
Regex: optimize compilation by reserving data
2017-11-28 00:59:57 +08:00
Maxime Coste
a52da6fe34
Regex: Tweak is_ctype implementation style
2017-11-28 00:13:42 +08:00
Maxime Coste
8b40f57145
Regex: Replace generic 'Matchers' with specialized functionality
...
Introduce CharacterClass and CharacterType Regex Op, and optimize
their evaluation.
2017-11-25 18:14:15 +08:00
Maxime Coste
0d44cf9591
Regex: do not decode utf8 in accept calls as they always run on ascii
2017-11-25 18:13:27 +08:00
Maxime Coste
ffb639bf96
Regex: add unit test for #1693
2017-11-13 01:12:05 +08:00
fsub
0dd8a9ba93
Fix #1693 : typo in RegexParser::character_class()
2017-11-12 17:35:03 +01:00
Maxime Coste
f07375fb27
Regex: remove dead code
2017-11-01 14:05:15 +08:00
Maxime Coste
2c2073b417
Regex: Tweak struct layouts of ParsedRegex data
2017-11-01 14:05:15 +08:00
Maxime Coste
bbd7e604dc
Regex: Remove "Ast" from names in the ParsedRegex
...
It does not add much value, and makes names longer.
2017-11-01 14:05:15 +08:00
Maxime Coste
18a02ccacd
Regex: Optimize parsing and compilation
...
AstNodes are now POD, stored in a single vector, accessed through
their index. The children list is implicit, with nodes storing only
the node index at which their child graph ends.
That makes reverse iteration slower, but that is only used for reverse
matching regex, which are uncommon. In the general case compilation
is now faster.
2017-11-01 14:05:15 +08:00
Maxime Coste
aea2de885d
Regex: minor cleanup of the regex parsing code
2017-11-01 14:05:15 +08:00
Maxime Coste
6e0275e550
Regex: small code cleanup in the Save compilation code
2017-11-01 14:05:15 +08:00
Maxime Coste
9e15207d2a
Regex: put the other char boolean inside the general start char map
2017-11-01 14:05:15 +08:00
Maxime Coste
60e32d73ff
Regex: Fix handling of all unicode codepoint as start chars
2017-11-01 14:05:15 +08:00
Maxime Coste
df2bf9601c
Regex: fix wrong fallthough in dump_regex
2017-11-01 14:05:15 +08:00
Maxime Coste
d9b4076e3c
Regex: Go back to instruction based search of next start
...
The previous method, which was a bit faster in the general use case,
can hit some cases where we get quadratic behaviour and very slow
matching.
By using an instruction, we can guarantee our complexity of O(N*M)
as we will never have more than N threads (N being the instruction
count) and we run the threads once per codepoint in the subject
string.
That slows down the general case slightly, but ensure we dont have
pathological cases.
This new version is much faster than the previous instruction based
search because it does not use a plain `.*` searcher, but a specific,
smarter instruction specialized for finding the next start if we are
in the correct conditions.
2017-11-01 14:05:15 +08:00
Maxime Coste
3f627058b0
Regex: add support for \0, \cX, \xXX and \uXXXX escapes
2017-11-01 14:05:15 +08:00
Maxime Coste
c423b47109
Regex: compute if codepoints outside of the start chars map can start
2017-11-01 14:05:15 +08:00
Maxime Coste
2c6c0be0c1
Regex: abort compilation as soon as we hit the instruction count limit
2017-11-01 14:05:15 +08:00
Maxime Coste
d44e160aa7
Regex: add a unit test for why lookaheads dont count for start chars anymore
2017-11-01 14:05:15 +08:00
Maxime Coste
87eec79d07
Regex: comment the mutables in CompiledRegex::Instruction and fix their init
2017-11-01 14:05:14 +08:00
Maxime Coste
8b2297f5ca
Regex: Introduce a Regex memory domain to track usage separately
2017-11-01 14:05:14 +08:00
Maxime Coste
9ec175f2f8
Regex: use binary search to for character class ranges check
2017-11-01 14:05:14 +08:00
Maxime Coste
6e65589a34
Regex: compute start chars from matchers, do not compute it from lookarounds
...
Computing potential start characters from lookarounds is more complex
than expected, and not worth the complexity.
2017-11-01 14:05:14 +08:00
Maxime Coste
df16fea82d
Regex: rename "flags" with the more common "modifiers"
2017-11-01 14:05:14 +08:00
Maxime Coste
52d443f764
Regex: Correctly handle ignore case mode for start chars computation
2017-11-01 14:05:14 +08:00