Commit Graph

101 Commits

Author SHA1 Message Date
Maxime Coste
da80a8cf6a Raise ThreadedVM initial thread capacity to 16
Threads are 4 bytes, an initial capacity of 4 led to allocating 16
bytes, raising that to 64 bytes seems quite reasonable.
2021-03-03 20:51:24 +11:00
Maxime Coste
d539e8fb89 Do not decode utf-8 when looking for regex next start
There is no need to decode as we know any non-ascii characters will
be treated as Other in the StartDesc.
2019-12-04 22:33:11 +11:00
Jason Felice
d26bb0ce2b Add static or const where useful 2019-11-09 12:53:45 -05:00
Maxime Coste
d9d2140ea2 Fix regex not always selecting the leftmost longest match
(Actually the rightmost longest match when searching backwards)

Fixes #2710
2019-02-04 17:33:29 +11:00
Maxime Coste
77b1216ace Add a peephole optimization pass to the regex compiler 2019-01-20 22:59:28 +11:00
Maxime Coste
0364a99827 Refactor regex find next start not to be an instruction anymore
The same logic can be hard coded, avoiding one thread and 3
instructions, improving the regex matching speed.
2019-01-20 22:59:28 +11:00
Maxime Coste
fd043435e5 Split compile time regex flags from runtime ones 2019-01-20 22:59:28 +11:00
Maxime Coste
328c497be2 Add support for named captures to the regex impl and regex highlighter
ECMAScript is adding support for it, and it is a pretty isolated
change to do.

Fixes #2293
2019-01-03 22:55:50 +11:00
Maxime Coste
ef3419edbf Do not pass thread to failed/consumed, capture it implicitely 2018-12-19 19:16:14 +11:00
Maxime Coste
0b9f782691 Take iterators by const-ref in ThreadedRegexVM::exec 2018-12-19 19:14:42 +11:00
Maxime Coste
021ba55b38 Small code tweak in DualThreadStack::swap_next 2018-11-14 17:50:17 +11:00
Maxime Coste
8c2c3d27ad Fix memory leak in DualThreadStack
Fixes #2556
2018-11-07 12:28:41 +11:00
Maxime Coste
7f83c41256 align ThreadedRegexVM::Thread to permit fused copy optimization
Aligning makes gcc able to copy a Thread object with a single
32bit mov instruction instead of two 16bits one.
2018-11-06 20:13:09 +11:00
Maxime Coste
05a9eb62f4 Never grow the DualThreadStack in push_next
As we do at most one push_next per step_thread, and we pop_current
before step_thread, we can avoid a branch there at the expense of
sometimes growing unecessarily (once).
2018-11-06 07:32:47 +11:00
Maxime Coste
7fbde0d44e Various micro performance tweaks in ThreadedRegexVM 2018-11-05 21:54:29 +11:00
Maxime Coste
7959c7f731 Refactor ThreadedRegexVM::exec_program to avoid branching
Moving logic into step_thread instead of returning an enum to
select what to run avoids the switch logic and improves run time.
2018-11-05 19:46:53 +11:00
Maxime Coste
7463a0d449 Remove use of utf8::iterator in regex execution
This avoids having two copies of the subject string bounds, one
in the ExecConfig and one in the utf8 iterator.
2018-11-05 08:17:50 +11:00
Maxime Coste
4ac7df3842 Remove most regex impl special casing for backwards matching 2018-11-03 13:52:40 +11:00
Maxime Coste
ee74c2c2df Use custom code instead of reverse_iterator in Regex VM 2018-11-02 08:23:39 +11:00
Maxime Coste
6fce8050ee Use BufferCoord sentinel type for regex matching on BufferIterators
BufferIterators are large-ish, and need to check the buffer pointer
on comparison. Checking against a coord is just a 64 bit comparison.
2018-11-01 21:51:10 +11:00
Maxime Coste
4cd7583bbc Improve regex vm to next start performance by avoiding iterator copies 2018-11-01 08:22:43 +11:00
Maxime Coste
d652ec9ce1 Cleanup regex lookarounds implementation and reject incompatible regex
Fixes #2487
2018-10-10 22:47:59 +11:00
Maxime Coste
9024d41d64 Fix integer overflow leading to bad memory access in regex execution
Fixes #2481
Fixes #2480
2018-10-08 12:43:12 +11:00
Maxime Coste
7cf3cbde8e Cleanup some trailing whitespaces and double semicolon 2018-07-26 21:56:34 +10:00
Maxime Coste
0d6e04257b Fix memory leak in regex execution 2018-07-25 20:57:11 +10:00
Maxime Coste
7ed5d53fe6 Fix RegexCompileFlags::Backwards having the same value as Optimize
That means every Optimized regex had the Backwards version
compiled as well, which doubled the time it took to compile them
and doubled the memory usage of regex.

This should improve #2152
2018-07-19 18:34:40 +10:00
Olivier Perret
67655de947 Use a dedicated vm op for dot when match-newline is false 2018-06-24 12:41:50 +02:00
Maxime Coste
787ca7f19b Regex: small code style tweak 2018-04-29 19:58:18 +10:00
Maxime Coste
1e8026f143 Regex: Use only 128 characters in start desc and encode others as 0
Using 257 was using lots of memory for no good reason, as > 127
codepoint are not common enough to be treated specially.
2018-04-29 19:58:18 +10:00
Maxime Coste
528ecb7417 Regex: Use a custom 'DualThreadStack' structure to hold thread info
Instead of using two vectors, we can hold both current and next
threads in a single buffer, with stacks growing on each end.

Benchmarking shows this to be slightly faster, and should use less memory.
2018-04-29 19:58:18 +10:00
Maxime Coste
8438b33175 Add a debug regex command to dump regex instructions 2018-04-27 08:35:09 +10:00
Maxime Coste
f10eb9faa3 Use indices instead of pointers for saves/instruction in ThreadedRegexVM
Performance seems unaffacted, but memory usage should be lowered
as the Thread struct is 4 bytes instead of 16.
2018-04-27 08:35:09 +10:00
Maxime Coste
fa17c46653 Regex: Refactor ThreadedRegexVM state handling
Remove ExecState to store threads inside the ThreadedRegexVM so that
memory buffers can be reused between executions. Extract an ExecConfig
struct with all the data thats execution specific to avoid storing
it needlessly inside the ThreadedRegexVM.
2018-04-25 21:19:04 +10:00
Maxime Coste
fb65fa60f8 Regex: take the full subject range as a parameter
To allow more general look arounds out of the actual search range,
pass a second range (the actual subject). This allows us to remove
various flags such as PrevAvailable or NotBeginOfSubject, which are
now easy to check from the subject range.

Fixes #1902
2018-03-05 05:48:10 +11:00
Maxime Coste
d9e44dfacf Regex: Remove helper functions from regex_impl.hh
They were close duplicates from the ones in regex.hh and not used
anywhere else.
2018-03-05 03:10:47 +11:00
Maxime Coste
933ac4d3d5 Regex: Improve comments and constify some variables
Reword various comments to make some tricky parts of the regex
engine easier to understand.
2018-02-24 17:40:08 +11:00
Maxime Coste
af21d4ca1e regex: track CompiledRegex::StartDesc in the Regex memory domain 2018-02-24 16:29:24 +11:00
Maxime Coste
6851604546 Regex: Add a RegexExecFlags::NotEndOfSubject flag 2017-12-29 09:55:38 +11:00
Maxime Coste
413f880e9e Regex: Support forward and backward matching code in the same CompiledRegex
No need to have two separate regexes to handle forward and backward
matching, just passing RegexCompileFlags::Backward will add support
for backward matching to the regex. For backward only regex, pass
RegexCompileFlags::NoForward as well to disable generation of
forward matching code.
2017-12-01 19:57:02 +08:00
Maxime Coste
8d892eeb62 Regex: use StartDesc to early out when not searching
Early out as well if we do not find any potential start position.
2017-12-01 15:03:03 +08:00
Maxime Coste
65b057f261 Regex: rename StartChars to StartDesc
It only contains chars for now, but its still more generally
describing where matches can start.
2017-12-01 14:46:18 +08:00
Maxime Coste
a52da6fe34 Regex: Tweak is_ctype implementation style 2017-11-28 00:13:42 +08:00
Maxime Coste
8b40f57145 Regex: Replace generic 'Matchers' with specialized functionality
Introduce CharacterClass and CharacterType Regex Op, and optimize
their evaluation.
2017-11-25 18:14:15 +08:00
Maxime Coste
5cfccad39c Regex: Use MemoryDomain::Regex for captures and MatchResults contents 2017-11-12 12:30:21 +08:00
Maxime Coste
c9b43d3634 Regex: directly store instruction pointer in Thread struct 2017-11-11 15:15:13 +08:00
Maxime Coste
c74becc6af Regex: fix RegexCompileFlags not being an enum class 2017-11-01 14:05:15 +08:00
Maxime Coste
2d901dc76f Regex: slight readability improvement and workaround a potential gcc bug 2017-11-01 14:05:15 +08:00
Maxime Coste
9e15207d2a Regex: put the other char boolean inside the general start char map 2017-11-01 14:05:15 +08:00
Maxime Coste
e9e9a08e7b Regex: refactor handling of Saves slightly, do not create them until really needed 2017-11-01 14:05:15 +08:00
Maxime Coste
d9b4076e3c Regex: Go back to instruction based search of next start
The previous method, which was a bit faster in the general use case,
can hit some cases where we get quadratic behaviour and very slow
matching.

By using an instruction, we can guarantee our complexity of O(N*M)
as we will never have more than N threads (N being the instruction
count) and we run the threads once per codepoint in the subject
string.

That slows down the general case slightly, but ensure we dont have
pathological cases.

This new version is much faster than the previous instruction based
search because it does not use a plain `.*` searcher, but a specific,
smarter instruction specialized for finding the next start if we are
in the correct conditions.
2017-11-01 14:05:15 +08:00