Commit Graph

149 Commits

Author SHA1 Message Date
Maxime Coste
8566ae14a0 Reduce the amount of Regex VM Instruction code
Merge all lookarounds into the same instruction, merge splits, merge
literal ignore case with literal...

Besides reducing the amount of almost duplicated code, this improves
performance by reducing pressure on the (often failing) branch target
prediction for instruction dispatching by moving branches into the
instruction code themselves where they are more likely to be well
predicted.
2021-11-21 09:44:18 +11:00
Peter Pentchev
aa88f459ff Use the [[gnu::packed]] C++ attribute.
Suggested by: Maxime Coste <mawww@kakoune.org>
2021-08-21 17:06:14 +03:00
Peter Pentchev
6e686af8b5 Do not break non-GCC/g++ compilers. 2021-08-20 17:21:26 +03:00
Peter Pentchev
0e9624f69f Make sure the ParsedRegex structure has the right size.
Some versions of GCC/g++ will not necessarily pad the structure to
a 32-bit boundary, so make the alignment and the filler explicit.

Detected on: Debian/m68k; https://buildd.debian.org/status/fetch.php?pkg=kakoune&arch=m68k&ver=2020.09.01-1&stamp=1629387444&raw=0
2021-08-20 17:13:34 +03:00
Maxime Coste
b57dc7c512 Code style tweak for Regex implementation TestVM 2021-07-31 08:55:52 +10:00
Maxime Coste
a0c23ccb72 Add missing limits includes
Fixes #4003
2021-01-03 10:58:09 +11:00
Maxime Coste
e9cf0f23f2 Fix regex start desc computation for case insensitive ranges
Fixes #3345
2020-02-07 07:37:29 +11:00
Maxime Coste
eb5af59d55 Restore regex optimization pass by introducing basic block analysis
Run the peephole optimizer on each basic block, avoiding the
previous issue that some instructions could move across their
boundaries.
2019-12-05 21:10:14 +11:00
Jason Felice
d26bb0ce2b Add static or const where useful 2019-11-09 12:53:45 -05:00
Maxime Coste
3e7301ede7 Support \x and \u escapes in regex character classes
Change \u to use 6 digits to cover the full unicode range.

Fixes #3172
2019-11-06 20:48:48 +11:00
Tobias Kortkamp
16bb55edee
Fix build on FreeBSD
file.cc:390:21: error: use of undeclared identifier 'rename'; did you mean 'devname'?
    if (replace and rename(temp_filename, zfilename) != 0)
                    ^~~~~~
                    devname
/usr/include/stdlib.h:277:7: note: 'devname' declared here
char    *devname(__dev_t, __mode_t);
         ^
file.cc:390:28: error: cannot initialize a parameter of type '__dev_t' (aka 'unsigned long') with an lvalue of type 'char [1024]'
    if (replace and rename(temp_filename, zfilename) != 0)
                           ^~~~~~~~~~~~~
/usr/include/stdlib.h:277:22: note: passing argument to parameter here
char    *devname(__dev_t, __mode_t);
                        ^
2 errors generated.

---

highlighters.cc:1110:13: error: use of undeclared identifier 'snprintf'; did you mean 'vswprintf'?
            snprintf(buffer, 16, format, std::abs(line_to_format));
            ^~~~~~~~
            vswprintf
/usr/include/wchar.h:139:5: note: 'vswprintf' declared here
int     vswprintf(wchar_t * __restrict, size_t n, const wchar_t * __restrict,
        ^
highlighters.cc:1110:22: error: cannot initialize a parameter of type 'wchar_t *' with an lvalue of type 'char [16]'
            snprintf(buffer, 16, format, std::abs(line_to_format));
                     ^~~~~~
/usr/include/wchar.h:139:35: note: passing argument to parameter here
int     vswprintf(wchar_t * __restrict, size_t n, const wchar_t * __restrict,
                                      ^
2 errors generated.

---

json_ui.cc:60:13: error: use of undeclared identifier 'sprintf'; did you mean 'swprintf'?
            sprintf(buf, "\\u%04x", *next);
            ^~~~~~~
            swprintf
/usr/include/wchar.h:133:5: note: 'swprintf' declared here
int     swprintf(wchar_t * __restrict, size_t n, const wchar_t * __restrict,
        ^
json_ui.cc:60:21: error: cannot initialize a parameter of type 'wchar_t *' with an lvalue of type 'char [7]'
            sprintf(buf, "\\u%04x", *next);
                    ^~~
/usr/include/wchar.h:133:34: note: passing argument to parameter here
int     swprintf(wchar_t * __restrict, size_t n, const wchar_t * __restrict,
                                     ^
json_ui.cc:74:9: error: use of undeclared identifier 'sprintf'
        sprintf(buffer, R"("#%02x%02x%02x")", color.r, color.g, color.b);
        ^
3 errors generated.

---

regex_impl.cc:1039:9: error: use of undeclared identifier 'sprintf'; did you mean 'swprintf'?
        sprintf(buf, " %03d     ", count++);
        ^~~~~~~
        swprintf
/usr/include/wchar.h:133:5: note: 'swprintf' declared here
int     swprintf(wchar_t * __restrict, size_t n, const wchar_t * __restrict,
        ^
regex_impl.cc:1039:17: error: cannot initialize a parameter of type 'wchar_t *' with an lvalue of type 'char [20]'
        sprintf(buf, " %03d     ", count++);
                ^~~
/usr/include/wchar.h:133:34: note: passing argument to parameter here
int     swprintf(wchar_t * __restrict, size_t n, const wchar_t * __restrict,
                                     ^
regex_impl.cc:1197:17: error: use of undeclared identifier 'puts'
    { if (dump) puts(dump_regex(*this).c_str()); }
                ^
regex_impl.cc:1208:18: note: in instantiation of member function 'Kakoune::(anonymous namespace)::TestVM<Kakoune::RegexMode::Forward>::TestVM' requested here
        TestVM<> vm{R"(a*b)"};
                 ^
regex_impl.cc:1197:17: error: use of undeclared identifier 'puts'
    { if (dump) puts(dump_regex(*this).c_str()); }
                ^
regex_impl.cc:1283:56: note: in instantiation of member function 'Kakoune::(anonymous namespace)::TestVM<5>::TestVM' requested here
        TestVM<RegexMode::Forward | RegexMode::Search> vm{R"(f.*a(.*o))"};
                                                       ^
regex_impl.cc:1197:17: error: use of undeclared identifier 'puts'
    { if (dump) puts(dump_regex(*this).c_str()); }
                ^
regex_impl.cc:1423:57: note: in instantiation of member function 'Kakoune::(anonymous namespace)::TestVM<6>::TestVM' requested here
        TestVM<RegexMode::Backward | RegexMode::Search> vm{R"(fo{1,})"};
                                                        ^
5 errors generated.

---

remote.cc:829:9: error: use of undeclared identifier 'rename'; did you mean 'devname'?
    if (rename(old_socket_file.c_str(), new_socket_file.c_str()) != 0)
        ^~~~~~
        devname
/usr/include/stdlib.h:277:7: note: 'devname' declared here
char    *devname(__dev_t, __mode_t);
         ^
remote.cc:829:16: error: cannot initialize a parameter of type '__dev_t' (aka 'unsigned long') with an rvalue of type 'const char *'
    if (rename(old_socket_file.c_str(), new_socket_file.c_str()) != 0)
               ^~~~~~~~~~~~~~~~~~~~~~~
/usr/include/stdlib.h:277:22: note: passing argument to parameter here
char    *devname(__dev_t, __mode_t);
                        ^
2 errors generated.

---

string_utils.cc:126:20: error: use of undeclared identifier 'sprintf'; did you mean 'swprintf'?
    res.m_length = sprintf(res.m_data, "%i", val);
                   ^~~~~~~
                   swprintf
/usr/include/wchar.h:133:5: note: 'swprintf' declared here
int     swprintf(wchar_t * __restrict, size_t n, const wchar_t * __restrict,
        ^
string_utils.cc:126:28: error: cannot initialize a parameter of type 'wchar_t *' with an lvalue of type 'char [15]'
    res.m_length = sprintf(res.m_data, "%i", val);
                           ^~~~~~~~~~
/usr/include/wchar.h:133:34: note: passing argument to parameter here
int     swprintf(wchar_t * __restrict, size_t n, const wchar_t * __restrict,
                                     ^
string_utils.cc:133:20: error: use of undeclared identifier 'sprintf'; did you mean 'swprintf'?
    res.m_length = sprintf(res.m_data, "%u", val);
                   ^~~~~~~
                   swprintf
[...]
2019-07-06 08:53:47 +02:00
Justin Frank
8178400f8d Fixed all reorder warnings 2019-02-27 22:45:31 -08:00
Maxime Coste
5c0175d90a Remove peephole regex optimization pass
The current implementation is wrong as it crosses basic blocks
boundaries. Doing basic block decomposition of regex is probably
a tad too complex for this single optimization.

Fixes #2711
2019-02-04 22:10:19 +11:00
Maxime Coste
d9d2140ea2 Fix regex not always selecting the leftmost longest match
(Actually the rightmost longest match when searching backwards)

Fixes #2710
2019-02-04 17:33:29 +11:00
Maxime Coste
77b1216ace Add a peephole optimization pass to the regex compiler 2019-01-20 22:59:28 +11:00
Maxime Coste
0364a99827 Refactor regex find next start not to be an instruction anymore
The same logic can be hard coded, avoiding one thread and 3
instructions, improving the regex matching speed.
2019-01-20 22:59:28 +11:00
Maxime Coste
fd043435e5 Split compile time regex flags from runtime ones 2019-01-20 22:59:28 +11:00
Maxime Coste
2afc147b2c Refactor parsed regex children iteration to use regular range-for loops 2019-01-20 22:59:28 +11:00
Maxime Coste
328c497be2 Add support for named captures to the regex impl and regex highlighter
ECMAScript is adding support for it, and it is a pretty isolated
change to do.

Fixes #2293
2019-01-03 22:55:50 +11:00
Maxime Coste
b4571bd172 Dump start description as well when writing a regex dump 2018-11-04 12:01:29 +11:00
Maxime Coste
4ac7df3842 Remove most regex impl special casing for backwards matching 2018-11-03 13:52:40 +11:00
Maxime Coste
4cfb46ff2e Support different type for iterators and sentinel in utf8 functions 2018-11-01 08:22:43 +11:00
Maxime Coste
d652ec9ce1 Cleanup regex lookarounds implementation and reject incompatible regex
Fixes #2487
2018-10-10 22:47:59 +11:00
Maxime Coste
cde0c51cd6 Tweak comment to make it less ambiguous 2018-07-08 16:58:19 +10:00
Olivier Perret
67655de947 Use a dedicated vm op for dot when match-newline is false 2018-06-24 12:41:50 +02:00
Olivier Perret
b5ee1db1c4 Use bit-flags for storing regex regex options 2018-06-24 12:41:50 +02:00
Olivier Perret
8edef8b3f1 Add support for regex flag to toggle dot-matches-newline 2018-06-24 12:41:50 +02:00
Maxime Coste
1fb53ca712 Fix wrong use of constexpr 2018-04-30 07:41:31 +10:00
Maxime Coste
1e8026f143 Regex: Use only 128 characters in start desc and encode others as 0
Using 257 was using lots of memory for no good reason, as > 127
codepoint are not common enough to be treated specially.
2018-04-29 19:58:18 +10:00
Maxime Coste
a1b8864c77 Merge remote-tracking branch 'lenormf/regex-format-string' into HEAD 2018-04-28 09:29:57 +10:00
Maxime Coste
2b9ec411d3 fix potential overflow in dump_regex 2018-04-28 09:29:15 +10:00
Frank LENORMAND
9bac04d35f regex_impl: Fix a potential format string flaw 2018-04-27 09:24:22 +03:00
Maxime Coste
8438b33175 Add a debug regex command to dump regex instructions 2018-04-27 08:35:09 +10:00
Maxime Coste
f10eb9faa3 Use indices instead of pointers for saves/instruction in ThreadedRegexVM
Performance seems unaffacted, but memory usage should be lowered
as the Thread struct is 4 bytes instead of 16.
2018-04-27 08:35:09 +10:00
Maxime Coste
71a1893a5e Fix some trailing spaces and a tab that sneaked into the code base 2018-04-05 08:52:33 +10:00
Maxime Coste
b27d4afa8d Regex: Only allow SyntaxCharacter and - to be escaped in a character class
Letting any character to be escaped is error prone as it looks like
\l could mean [:lower:] (as it used to with boost) when it only means
literal l.

Fix the haskell.kak file as well.

Fixes #1945
2018-03-20 04:57:47 +11:00
Maxime Coste
fb65fa60f8 Regex: take the full subject range as a parameter
To allow more general look arounds out of the actual search range,
pass a second range (the actual subject). This allows us to remove
various flags such as PrevAvailable or NotBeginOfSubject, which are
now easy to check from the subject range.

Fixes #1902
2018-03-05 05:48:10 +11:00
Maxime Coste
933ac4d3d5 Regex: Improve comments and constify some variables
Reword various comments to make some tricky parts of the regex
engine easier to understand.
2018-02-24 17:40:08 +11:00
Maxime Coste
3584e00d19 Regex: Use a template argument instead of a regular one for "forward"
forward (which controls if we are compling for forward or backward
matching) is always statically known, and compilation will first
compile forward, then backward (if needed), so by having separate
compiled function we get rid of runtime branches.
2018-02-09 22:45:53 +11:00
Maxime Coste
aa9f7753e8 Regex: minor code cleanup 2018-02-09 22:19:56 +11:00
Maxime Coste
413f880e9e Regex: Support forward and backward matching code in the same CompiledRegex
No need to have two separate regexes to handle forward and backward
matching, just passing RegexCompileFlags::Backward will add support
for backward matching to the regex. For backward only regex, pass
RegexCompileFlags::NoForward as well to disable generation of
forward matching code.
2017-12-01 19:57:02 +08:00
Maxime Coste
7bfb695c45 Regex: Do not allow private use codepoints literals
We use them to encode non-literals in lookarounds, so they can
trigger bugs.

Fixes #1737
2017-12-01 16:37:18 +08:00
Maxime Coste
65b057f261 Regex: rename StartChars to StartDesc
It only contains chars for now, but its still more generally
describing where matches can start.
2017-12-01 14:46:18 +08:00
Maxime Coste
b91f43b031 Regex: optimize parsing a bit 2017-11-30 14:32:29 +08:00
Maxime Coste
c1f0efa3f4 Regex: smarter handling of start chars computation for character class 2017-11-30 14:19:41 +08:00
Maxime Coste
ae0911b533 Regex: Various small code tweaks 2017-11-28 01:03:54 +08:00
Maxime Coste
4598832ed5 Regex: optimize compilation by reserving data 2017-11-28 00:59:57 +08:00
Maxime Coste
a52da6fe34 Regex: Tweak is_ctype implementation style 2017-11-28 00:13:42 +08:00
Maxime Coste
8b40f57145 Regex: Replace generic 'Matchers' with specialized functionality
Introduce CharacterClass and CharacterType Regex Op, and optimize
their evaluation.
2017-11-25 18:14:15 +08:00
Maxime Coste
0d44cf9591 Regex: do not decode utf8 in accept calls as they always run on ascii 2017-11-25 18:13:27 +08:00