Maxime Coste
8b40f57145
Regex: Replace generic 'Matchers' with specialized functionality
...
Introduce CharacterClass and CharacterType Regex Op, and optimize
their evaluation.
2017-11-25 18:14:15 +08:00
Maxime Coste
5cfccad39c
Regex: Use MemoryDomain::Regex for captures and MatchResults contents
2017-11-12 12:30:21 +08:00
Maxime Coste
c9b43d3634
Regex: directly store instruction pointer in Thread struct
2017-11-11 15:15:13 +08:00
Maxime Coste
c74becc6af
Regex: fix RegexCompileFlags not being an enum class
2017-11-01 14:05:15 +08:00
Maxime Coste
2d901dc76f
Regex: slight readability improvement and workaround a potential gcc bug
2017-11-01 14:05:15 +08:00
Maxime Coste
9e15207d2a
Regex: put the other char boolean inside the general start char map
2017-11-01 14:05:15 +08:00
Maxime Coste
e9e9a08e7b
Regex: refactor handling of Saves slightly, do not create them until really needed
2017-11-01 14:05:15 +08:00
Maxime Coste
d9b4076e3c
Regex: Go back to instruction based search of next start
...
The previous method, which was a bit faster in the general use case,
can hit some cases where we get quadratic behaviour and very slow
matching.
By using an instruction, we can guarantee our complexity of O(N*M)
as we will never have more than N threads (N being the instruction
count) and we run the threads once per codepoint in the subject
string.
That slows down the general case slightly, but ensure we dont have
pathological cases.
This new version is much faster than the previous instruction based
search because it does not use a plain `.*` searcher, but a specific,
smarter instruction specialized for finding the next start if we are
in the correct conditions.
2017-11-01 14:05:15 +08:00
Maxime Coste
c423b47109
Regex: compute if codepoints outside of the start chars map can start
2017-11-01 14:05:15 +08:00
Maxime Coste
87eec79d07
Regex: comment the mutables in CompiledRegex::Instruction and fix their init
2017-11-01 14:05:14 +08:00
Maxime Coste
8b2297f5ca
Regex: Introduce a Regex memory domain to track usage separately
2017-11-01 14:05:14 +08:00
Maxime Coste
621b0d3ab8
Regex: remove the need to a processed inst vector
...
Identify each step with a counter, and check if the instruction
was already processed this step. This makes the matching faster,
by removing the need to maintain a vector of instructions executed
this step.
2017-11-01 14:05:14 +08:00
Maxime Coste
cfc52d7e6a
Regex: use intrusive linked list for the free saves instead of a Vector
2017-11-01 14:05:14 +08:00
Maxime Coste
b0233262b8
Regex: Limit programs to std::numeric_limits<uint16_t>::max() instructions
2017-11-01 14:05:14 +08:00
Maxime Coste
2b97e4e124
Regex: Fix handling of ^ and $ in backward matching mode
2017-11-01 14:05:14 +08:00
Maxime Coste
3c999aba37
Regex: Only reset processed and scheduled flags on relevant instructions
...
On big regex, reseting all those flags on all instructions for each
character can become the dominant operation. Track that actual
instructions index processed (the scheduled are already tracked in
the next_threads vector), and only reset these.
2017-11-01 14:05:14 +08:00
Maxime Coste
5bf4be645a
Regex: Fix support for ignore case in lookarounds
2017-11-01 14:05:14 +08:00
Maxime Coste
dd9e43e6f9
Regex: small code cleanup
2017-11-01 14:05:14 +08:00
Maxime Coste
c8966ca701
Regex: Assert that the regex direction matches the vm direction
2017-11-01 14:05:14 +08:00
Maxime Coste
df73b71dfc
Regex: fix lookarounds handling when computing starting chars
2017-11-01 14:05:14 +08:00
Maxime Coste
065bbc8f59
Regex: switch to custom impl, use boost for checking
2017-11-01 14:05:14 +08:00
Maxime Coste
9305fa1369
Regex: Fix lookaround use in moon.kak
...
(?=[A-Z]\w*) is strictly the same as (?=[A-Z]) as \w* will always
at least match an empty string.
2017-11-01 14:05:14 +08:00
Maxime Coste
cca730193c
Regex: Support any char and character classes in lookarounds
...
Lookarounds still need to be fixed size, but accept character classes
as well as plain literals.
2017-11-01 14:05:14 +08:00
Maxime Coste
b8cb65160a
Regex: use std::conditional instead of custom template class to choose Utf8It
2017-11-01 14:05:14 +08:00
Maxime Coste
ea85f79384
Regex: add elided braces to fix compilation on older gcc
2017-11-01 14:05:14 +08:00
Maxime Coste
9ec376135b
Regex: Introduce RegexExecFlags::PrevAvailable
...
Rework assertion code as well.
2017-11-01 14:05:14 +08:00
Maxime Coste
73e177ec59
Regex: Do not use sized deallocation to support more compilers
2017-11-01 14:05:14 +08:00
Maxime Coste
30dacdade2
Regex: deallocate Saves memory on ThreadedRegexVM destruction
2017-11-01 14:05:14 +08:00
Maxime Coste
f3736a4b48
Regex: tag instructions as scheduled as well instead of searching
...
And a few more code cleanup in the ThreadedRegexVM
2017-11-01 14:05:14 +08:00
Maxime Coste
6bc5823745
Regex: refactor ThreadedRegexVM::exec_from code
2017-11-01 14:05:14 +08:00
Maxime Coste
4ff655cc09
Regex: store the processed flag directly in CompiledRegex instructions
2017-11-01 14:05:14 +08:00
Maxime Coste
732b8bc2a4
Regex: abandon bytecode and just use a simple list of instructions
...
Makes the code simpler.
2017-11-01 14:05:14 +08:00
Maxime Coste
6434bca325
Regex: Add some comments, remove supurious semicolons
2017-11-01 14:05:14 +08:00
Maxime Coste
911a893225
Regex: fix get_base(std::reverse_iterator<...>) returning a ref to temporary
2017-11-01 14:05:14 +08:00
Maxime Coste
11abd544c6
Regex: avoid infinite loops
2017-11-01 14:05:14 +08:00
Maxime Coste
c47cdc06a7
Regex: Add support for backward matching
...
Regex can be compiled for backward matching instead of forward matching
and the ThreadedRegexVM is able to iterate in reverse on the subject
string to find the last match instead of the first.
2017-11-01 14:05:14 +08:00
Maxime Coste
52ee62172a
Regex: remove use of buffer_utils.hh from regex_impl.cc
2017-11-01 14:05:14 +08:00
Maxime Coste
c375268c2d
Regex: Use memcpy to write/read offsets from bytecode
...
reinterpret_cast was undefined behaviour as we do not guarantee
that offsets are going to be stored properly aligned.
2017-11-01 14:05:14 +08:00
Maxime Coste
236751cb84
Regex: Make ThreadedRegexVM a proper class, define a proper interface
2017-11-01 14:05:14 +08:00
Maxime Coste
3b69dda04e
Regex: Find potential start position using a map of valid start chars
...
With this optimization we get close to performance parity with boost
regex on the common use cases in Kakoune.
2017-11-01 14:05:14 +08:00
Maxime Coste
fabeab1ee1
Regex: reorder lookaround ops, group by direction
2017-11-01 14:05:14 +08:00
Maxime Coste
854144c535
Regex: Fix handling of Save instruction in ThreadedRegexVM
...
When not saving, we were not fully reading the instruction stream,
leading to an out of sync instruction pointer.
2017-11-01 14:05:14 +08:00
Maxime Coste
5f6e71c4dc
Regex: More code tweaks and cleanups in ThreadedRegexVM
2017-11-01 14:05:14 +08:00
Maxime Coste
5f54e0de0e
Regex: Code cleanup and refactor for Saves handling
2017-11-01 14:05:14 +08:00
Maxime Coste
dbb175841b
Regex: do not write the search prefix inside the program bytecode
...
Its faster to have specialized code in the VM directly
2017-11-01 14:05:14 +08:00
Maxime Coste
cf5055f68b
Regex: small code tweak
2017-11-01 14:05:14 +08:00
Maxime Coste
e0fac20f6c
Regex: Use a custom allocated buffer for Saves instead of a Vector
2017-11-01 14:05:14 +08:00
Maxime Coste
1399563e40
Regex: make m_current_threads and m_next_threads local variable of exec
2017-11-01 14:05:14 +08:00
Maxime Coste
54da8098ae
Regex: Add a NoSaves RegexExecFlags to disable saving positions
2017-11-01 14:05:14 +08:00
Maxime Coste
119bc38254
Regex: small refactor of ThreadedRegexVM::clone_saves
2017-11-01 14:05:14 +08:00