Commit Graph

122 Commits

Author SHA1 Message Date
Maxime Coste
1fb53ca712 Fix wrong use of constexpr 2018-04-30 07:41:31 +10:00
Maxime Coste
1e8026f143 Regex: Use only 128 characters in start desc and encode others as 0
Using 257 was using lots of memory for no good reason, as > 127
codepoint are not common enough to be treated specially.
2018-04-29 19:58:18 +10:00
Maxime Coste
a1b8864c77 Merge remote-tracking branch 'lenormf/regex-format-string' into HEAD 2018-04-28 09:29:57 +10:00
Maxime Coste
2b9ec411d3 fix potential overflow in dump_regex 2018-04-28 09:29:15 +10:00
Frank LENORMAND
9bac04d35f regex_impl: Fix a potential format string flaw 2018-04-27 09:24:22 +03:00
Maxime Coste
8438b33175 Add a debug regex command to dump regex instructions 2018-04-27 08:35:09 +10:00
Maxime Coste
f10eb9faa3 Use indices instead of pointers for saves/instruction in ThreadedRegexVM
Performance seems unaffacted, but memory usage should be lowered
as the Thread struct is 4 bytes instead of 16.
2018-04-27 08:35:09 +10:00
Maxime Coste
71a1893a5e Fix some trailing spaces and a tab that sneaked into the code base 2018-04-05 08:52:33 +10:00
Maxime Coste
b27d4afa8d Regex: Only allow SyntaxCharacter and - to be escaped in a character class
Letting any character to be escaped is error prone as it looks like
\l could mean [:lower:] (as it used to with boost) when it only means
literal l.

Fix the haskell.kak file as well.

Fixes #1945
2018-03-20 04:57:47 +11:00
Maxime Coste
fb65fa60f8 Regex: take the full subject range as a parameter
To allow more general look arounds out of the actual search range,
pass a second range (the actual subject). This allows us to remove
various flags such as PrevAvailable or NotBeginOfSubject, which are
now easy to check from the subject range.

Fixes #1902
2018-03-05 05:48:10 +11:00
Maxime Coste
933ac4d3d5 Regex: Improve comments and constify some variables
Reword various comments to make some tricky parts of the regex
engine easier to understand.
2018-02-24 17:40:08 +11:00
Maxime Coste
3584e00d19 Regex: Use a template argument instead of a regular one for "forward"
forward (which controls if we are compling for forward or backward
matching) is always statically known, and compilation will first
compile forward, then backward (if needed), so by having separate
compiled function we get rid of runtime branches.
2018-02-09 22:45:53 +11:00
Maxime Coste
aa9f7753e8 Regex: minor code cleanup 2018-02-09 22:19:56 +11:00
Maxime Coste
413f880e9e Regex: Support forward and backward matching code in the same CompiledRegex
No need to have two separate regexes to handle forward and backward
matching, just passing RegexCompileFlags::Backward will add support
for backward matching to the regex. For backward only regex, pass
RegexCompileFlags::NoForward as well to disable generation of
forward matching code.
2017-12-01 19:57:02 +08:00
Maxime Coste
7bfb695c45 Regex: Do not allow private use codepoints literals
We use them to encode non-literals in lookarounds, so they can
trigger bugs.

Fixes #1737
2017-12-01 16:37:18 +08:00
Maxime Coste
65b057f261 Regex: rename StartChars to StartDesc
It only contains chars for now, but its still more generally
describing where matches can start.
2017-12-01 14:46:18 +08:00
Maxime Coste
b91f43b031 Regex: optimize parsing a bit 2017-11-30 14:32:29 +08:00
Maxime Coste
c1f0efa3f4 Regex: smarter handling of start chars computation for character class 2017-11-30 14:19:41 +08:00
Maxime Coste
ae0911b533 Regex: Various small code tweaks 2017-11-28 01:03:54 +08:00
Maxime Coste
4598832ed5 Regex: optimize compilation by reserving data 2017-11-28 00:59:57 +08:00
Maxime Coste
a52da6fe34 Regex: Tweak is_ctype implementation style 2017-11-28 00:13:42 +08:00
Maxime Coste
8b40f57145 Regex: Replace generic 'Matchers' with specialized functionality
Introduce CharacterClass and CharacterType Regex Op, and optimize
their evaluation.
2017-11-25 18:14:15 +08:00
Maxime Coste
0d44cf9591 Regex: do not decode utf8 in accept calls as they always run on ascii 2017-11-25 18:13:27 +08:00
Maxime Coste
ffb639bf96 Regex: add unit test for #1693 2017-11-13 01:12:05 +08:00
fsub
0dd8a9ba93 Fix #1693: typo in RegexParser::character_class() 2017-11-12 17:35:03 +01:00
Maxime Coste
f07375fb27 Regex: remove dead code 2017-11-01 14:05:15 +08:00
Maxime Coste
2c2073b417 Regex: Tweak struct layouts of ParsedRegex data 2017-11-01 14:05:15 +08:00
Maxime Coste
bbd7e604dc Regex: Remove "Ast" from names in the ParsedRegex
It does not add much value, and makes names longer.
2017-11-01 14:05:15 +08:00
Maxime Coste
18a02ccacd Regex: Optimize parsing and compilation
AstNodes are now POD, stored in a single vector, accessed through
their index. The children list is implicit, with nodes storing only
the node index at which their child graph ends.

That makes reverse iteration slower, but that is only used for reverse
matching regex, which are uncommon. In the general case compilation
is now faster.
2017-11-01 14:05:15 +08:00
Maxime Coste
aea2de885d Regex: minor cleanup of the regex parsing code 2017-11-01 14:05:15 +08:00
Maxime Coste
6e0275e550 Regex: small code cleanup in the Save compilation code 2017-11-01 14:05:15 +08:00
Maxime Coste
9e15207d2a Regex: put the other char boolean inside the general start char map 2017-11-01 14:05:15 +08:00
Maxime Coste
60e32d73ff Regex: Fix handling of all unicode codepoint as start chars 2017-11-01 14:05:15 +08:00
Maxime Coste
df2bf9601c Regex: fix wrong fallthough in dump_regex 2017-11-01 14:05:15 +08:00
Maxime Coste
d9b4076e3c Regex: Go back to instruction based search of next start
The previous method, which was a bit faster in the general use case,
can hit some cases where we get quadratic behaviour and very slow
matching.

By using an instruction, we can guarantee our complexity of O(N*M)
as we will never have more than N threads (N being the instruction
count) and we run the threads once per codepoint in the subject
string.

That slows down the general case slightly, but ensure we dont have
pathological cases.

This new version is much faster than the previous instruction based
search because it does not use a plain `.*` searcher, but a specific,
smarter instruction specialized for finding the next start if we are
in the correct conditions.
2017-11-01 14:05:15 +08:00
Maxime Coste
3f627058b0 Regex: add support for \0, \cX, \xXX and \uXXXX escapes 2017-11-01 14:05:15 +08:00
Maxime Coste
c423b47109 Regex: compute if codepoints outside of the start chars map can start 2017-11-01 14:05:15 +08:00
Maxime Coste
2c6c0be0c1 Regex: abort compilation as soon as we hit the instruction count limit 2017-11-01 14:05:15 +08:00
Maxime Coste
d44e160aa7 Regex: add a unit test for why lookaheads dont count for start chars anymore 2017-11-01 14:05:15 +08:00
Maxime Coste
87eec79d07 Regex: comment the mutables in CompiledRegex::Instruction and fix their init 2017-11-01 14:05:14 +08:00
Maxime Coste
8b2297f5ca Regex: Introduce a Regex memory domain to track usage separately 2017-11-01 14:05:14 +08:00
Maxime Coste
9ec175f2f8 Regex: use binary search to for character class ranges check 2017-11-01 14:05:14 +08:00
Maxime Coste
6e65589a34 Regex: compute start chars from matchers, do not compute it from lookarounds
Computing potential start characters from lookarounds is more complex
than expected, and not worth the complexity.
2017-11-01 14:05:14 +08:00
Maxime Coste
df16fea82d Regex: rename "flags" with the more common "modifiers" 2017-11-01 14:05:14 +08:00
Maxime Coste
52d443f764 Regex: Correctly handle ignore case mode for start chars computation 2017-11-01 14:05:14 +08:00
Maxime Coste
b8495f0953 Regex: Rework parsing, treat lookarounds as assertions, and flags separately 2017-11-01 14:05:14 +08:00
Maxime Coste
b0233262b8 Regex: Limit programs to std::numeric_limits<uint16_t>::max() instructions 2017-11-01 14:05:14 +08:00
Maxime Coste
8c8dcb3a84 Regex: Fix reverse searching behaviour, again 2017-11-01 14:05:14 +08:00
Maxime Coste
9753bcd0ad Regex: limit explicit quantifiers value (too 1000 for now)
Fixes #1628
2017-11-01 14:05:14 +08:00
Maxime Coste
2b97e4e124 Regex: Fix handling of ^ and $ in backward matching mode 2017-11-01 14:05:14 +08:00