Commit Graph

84 Commits

Author SHA1 Message Date
Maxime Coste
4ac7df3842 Remove most regex impl special casing for backwards matching 2018-11-03 13:52:40 +11:00
Maxime Coste
ee74c2c2df Use custom code instead of reverse_iterator in Regex VM 2018-11-02 08:23:39 +11:00
Maxime Coste
6fce8050ee Use BufferCoord sentinel type for regex matching on BufferIterators
BufferIterators are large-ish, and need to check the buffer pointer
on comparison. Checking against a coord is just a 64 bit comparison.
2018-11-01 21:51:10 +11:00
Maxime Coste
4cd7583bbc Improve regex vm to next start performance by avoiding iterator copies 2018-11-01 08:22:43 +11:00
Maxime Coste
d652ec9ce1 Cleanup regex lookarounds implementation and reject incompatible regex
Fixes #2487
2018-10-10 22:47:59 +11:00
Maxime Coste
9024d41d64 Fix integer overflow leading to bad memory access in regex execution
Fixes #2481
Fixes #2480
2018-10-08 12:43:12 +11:00
Maxime Coste
7cf3cbde8e Cleanup some trailing whitespaces and double semicolon 2018-07-26 21:56:34 +10:00
Maxime Coste
0d6e04257b Fix memory leak in regex execution 2018-07-25 20:57:11 +10:00
Maxime Coste
7ed5d53fe6 Fix RegexCompileFlags::Backwards having the same value as Optimize
That means every Optimized regex had the Backwards version
compiled as well, which doubled the time it took to compile them
and doubled the memory usage of regex.

This should improve #2152
2018-07-19 18:34:40 +10:00
Olivier Perret
67655de947 Use a dedicated vm op for dot when match-newline is false 2018-06-24 12:41:50 +02:00
Maxime Coste
787ca7f19b Regex: small code style tweak 2018-04-29 19:58:18 +10:00
Maxime Coste
1e8026f143 Regex: Use only 128 characters in start desc and encode others as 0
Using 257 was using lots of memory for no good reason, as > 127
codepoint are not common enough to be treated specially.
2018-04-29 19:58:18 +10:00
Maxime Coste
528ecb7417 Regex: Use a custom 'DualThreadStack' structure to hold thread info
Instead of using two vectors, we can hold both current and next
threads in a single buffer, with stacks growing on each end.

Benchmarking shows this to be slightly faster, and should use less memory.
2018-04-29 19:58:18 +10:00
Maxime Coste
8438b33175 Add a debug regex command to dump regex instructions 2018-04-27 08:35:09 +10:00
Maxime Coste
f10eb9faa3 Use indices instead of pointers for saves/instruction in ThreadedRegexVM
Performance seems unaffacted, but memory usage should be lowered
as the Thread struct is 4 bytes instead of 16.
2018-04-27 08:35:09 +10:00
Maxime Coste
fa17c46653 Regex: Refactor ThreadedRegexVM state handling
Remove ExecState to store threads inside the ThreadedRegexVM so that
memory buffers can be reused between executions. Extract an ExecConfig
struct with all the data thats execution specific to avoid storing
it needlessly inside the ThreadedRegexVM.
2018-04-25 21:19:04 +10:00
Maxime Coste
fb65fa60f8 Regex: take the full subject range as a parameter
To allow more general look arounds out of the actual search range,
pass a second range (the actual subject). This allows us to remove
various flags such as PrevAvailable or NotBeginOfSubject, which are
now easy to check from the subject range.

Fixes #1902
2018-03-05 05:48:10 +11:00
Maxime Coste
d9e44dfacf Regex: Remove helper functions from regex_impl.hh
They were close duplicates from the ones in regex.hh and not used
anywhere else.
2018-03-05 03:10:47 +11:00
Maxime Coste
933ac4d3d5 Regex: Improve comments and constify some variables
Reword various comments to make some tricky parts of the regex
engine easier to understand.
2018-02-24 17:40:08 +11:00
Maxime Coste
af21d4ca1e regex: track CompiledRegex::StartDesc in the Regex memory domain 2018-02-24 16:29:24 +11:00
Maxime Coste
6851604546 Regex: Add a RegexExecFlags::NotEndOfSubject flag 2017-12-29 09:55:38 +11:00
Maxime Coste
413f880e9e Regex: Support forward and backward matching code in the same CompiledRegex
No need to have two separate regexes to handle forward and backward
matching, just passing RegexCompileFlags::Backward will add support
for backward matching to the regex. For backward only regex, pass
RegexCompileFlags::NoForward as well to disable generation of
forward matching code.
2017-12-01 19:57:02 +08:00
Maxime Coste
8d892eeb62 Regex: use StartDesc to early out when not searching
Early out as well if we do not find any potential start position.
2017-12-01 15:03:03 +08:00
Maxime Coste
65b057f261 Regex: rename StartChars to StartDesc
It only contains chars for now, but its still more generally
describing where matches can start.
2017-12-01 14:46:18 +08:00
Maxime Coste
a52da6fe34 Regex: Tweak is_ctype implementation style 2017-11-28 00:13:42 +08:00
Maxime Coste
8b40f57145 Regex: Replace generic 'Matchers' with specialized functionality
Introduce CharacterClass and CharacterType Regex Op, and optimize
their evaluation.
2017-11-25 18:14:15 +08:00
Maxime Coste
5cfccad39c Regex: Use MemoryDomain::Regex for captures and MatchResults contents 2017-11-12 12:30:21 +08:00
Maxime Coste
c9b43d3634 Regex: directly store instruction pointer in Thread struct 2017-11-11 15:15:13 +08:00
Maxime Coste
c74becc6af Regex: fix RegexCompileFlags not being an enum class 2017-11-01 14:05:15 +08:00
Maxime Coste
2d901dc76f Regex: slight readability improvement and workaround a potential gcc bug 2017-11-01 14:05:15 +08:00
Maxime Coste
9e15207d2a Regex: put the other char boolean inside the general start char map 2017-11-01 14:05:15 +08:00
Maxime Coste
e9e9a08e7b Regex: refactor handling of Saves slightly, do not create them until really needed 2017-11-01 14:05:15 +08:00
Maxime Coste
d9b4076e3c Regex: Go back to instruction based search of next start
The previous method, which was a bit faster in the general use case,
can hit some cases where we get quadratic behaviour and very slow
matching.

By using an instruction, we can guarantee our complexity of O(N*M)
as we will never have more than N threads (N being the instruction
count) and we run the threads once per codepoint in the subject
string.

That slows down the general case slightly, but ensure we dont have
pathological cases.

This new version is much faster than the previous instruction based
search because it does not use a plain `.*` searcher, but a specific,
smarter instruction specialized for finding the next start if we are
in the correct conditions.
2017-11-01 14:05:15 +08:00
Maxime Coste
c423b47109 Regex: compute if codepoints outside of the start chars map can start 2017-11-01 14:05:15 +08:00
Maxime Coste
87eec79d07 Regex: comment the mutables in CompiledRegex::Instruction and fix their init 2017-11-01 14:05:14 +08:00
Maxime Coste
8b2297f5ca Regex: Introduce a Regex memory domain to track usage separately 2017-11-01 14:05:14 +08:00
Maxime Coste
621b0d3ab8 Regex: remove the need to a processed inst vector
Identify each step with a counter, and check if the instruction
was already processed this step. This makes the matching faster,
by removing the need to maintain a vector of instructions executed
this step.
2017-11-01 14:05:14 +08:00
Maxime Coste
cfc52d7e6a Regex: use intrusive linked list for the free saves instead of a Vector 2017-11-01 14:05:14 +08:00
Maxime Coste
b0233262b8 Regex: Limit programs to std::numeric_limits<uint16_t>::max() instructions 2017-11-01 14:05:14 +08:00
Maxime Coste
2b97e4e124 Regex: Fix handling of ^ and $ in backward matching mode 2017-11-01 14:05:14 +08:00
Maxime Coste
3c999aba37 Regex: Only reset processed and scheduled flags on relevant instructions
On big regex, reseting all those flags on all instructions for each
character can become the dominant operation. Track that actual
instructions index processed (the scheduled are already tracked in
the next_threads vector), and only reset these.
2017-11-01 14:05:14 +08:00
Maxime Coste
5bf4be645a Regex: Fix support for ignore case in lookarounds 2017-11-01 14:05:14 +08:00
Maxime Coste
dd9e43e6f9 Regex: small code cleanup 2017-11-01 14:05:14 +08:00
Maxime Coste
c8966ca701 Regex: Assert that the regex direction matches the vm direction 2017-11-01 14:05:14 +08:00
Maxime Coste
df73b71dfc Regex: fix lookarounds handling when computing starting chars 2017-11-01 14:05:14 +08:00
Maxime Coste
065bbc8f59 Regex: switch to custom impl, use boost for checking 2017-11-01 14:05:14 +08:00
Maxime Coste
9305fa1369 Regex: Fix lookaround use in moon.kak
(?=[A-Z]\w*) is strictly the same as (?=[A-Z]) as \w* will always
at least match an empty string.
2017-11-01 14:05:14 +08:00
Maxime Coste
cca730193c Regex: Support any char and character classes in lookarounds
Lookarounds still need to be fixed size, but accept character classes
as well as plain literals.
2017-11-01 14:05:14 +08:00
Maxime Coste
b8cb65160a Regex: use std::conditional instead of custom template class to choose Utf8It 2017-11-01 14:05:14 +08:00
Maxime Coste
ea85f79384 Regex: add elided braces to fix compilation on older gcc 2017-11-01 14:05:14 +08:00