= Regex == Regex syntax Kakoune's regex syntax is inspired after ECMAScript, as defined by the ECMA-262 standard (see <>). Kakoune's regex always runs on Unicode codepoint sequences, not on bytes. == Literals Every character except the syntax characters `\^$.*+?[]{}|().` match themselves. Syntax characters can be escaped with a backslash so that `\$` will match a literal `$`, and `\\` will match a literal `\`. Some literals are available as escape sequences: * `\f` matches the form feed character. * `\n` matches the newline character. * `\r` matches the carriage return character. * `\t` matches the tabulation character. * `\v` matches the vertical tabulation character. * `\0` matches the null character. * `\cX` matches the control-`X` character (`X` can be in `[A-Za-z]`). * `\xXX` matches the character whose codepoint is `XX` (in hexadecimal). * `\uXXXXXX` matches the character whose codepoint is `XXXXXX` (in hexadecimal). == Character classes The `[` character introduces a character class, matching one character from a set of characters. A character class contains a list of literals, character ranges, and character class escapes surrounded by `[` and `]`. If the first character inside a character class is `^`, then the character class is negated, meaning that it matches every character not specified in the character class. Literals match themselves, including syntax characters, so `^` does not need to be escaped in a character class. `[\*+]` matches both the `\*` character and the `+` character. Literal escape sequences are supported, so `[\n\r]` matches both the newline and carriage return characters. The `]` character needs to be escaped for it to match a literal `]` instead of closing the character class. Character ranges are written as `-`, so `[A-Z]` matches all uppercase basic letters. `[A-Z0-9]` will match all uppercase basic letters and all basic digits. The `-` characters in a character class that are not specifying a range are treated as literal `-`, so `[A-Z-+]` matches all upper case characters, the `-` character, and the `+` character. Supported character class escapes are: * `\d` which matches digits 0-9. * `\w` which matches word characters A-Z, a-z, 0-9 and underscore (ignoring the `extra_word_chars` option). * `\s` which matches all Unicode whitespace characters. * `\h` which matches whitespace except Vertical Tab and line-breaks. Using an upper case letter instead of a lower case one will negate the character class. For example, `\D` will match every non-digit character. Character class escapes can be used outside of a character class, `\d` is equivalent to `[\d]`. == Any character `.` matches any character, including newlines, by default. (see <> on how to change it) == Groups Regex atoms can be grouped using `(` and `)` or `(?:` and `)`. If `(` is used, the group will be a capturing group, which means the positions from the subject strings that matched between `(` and `)` will be recorded. Capture groups are numbered starting at 1. They are numbered in the order of appearance of their `(` in the regex. A special capture group 0 is for the whole sequence that matched. * `(?:` introduces a non capturing group, which will not record the matching positions. * `(?` introduces a named capturing group, which, in addition to being referred by number, can be, in certain contexts, referred by the given name. == Alternations The `|` character introduces an alternation, which will either match its left-hand side, or its right-hand side (preferring the left-hand side) For example, `foo|bar` matches either `foo` or `bar`, `foo(bar|baz|qux)` matches `foo` followed by either `bar`, `baz` or `qux`. == Quantifier Literals, character classes, any characters, and groups can be followed by a quantifier, which specifies the number of times they can match. * `?` matches zero, or one time. * `*` matches zero or more times. * `+` matches one or more times. * `{n}` matches exactly `n` times. * `{n,}` matches `n` or more times. * `{n,m}` matches `n` to `m` times. * `{,m}` matches zero to `m` times. By default, quantifiers are *greedy*, which means they will prefer to match more characters if possible. Suffixing a quantifier with `?` will make it non-greedy, meaning it will prefer to match as few characters as possible. == Zero width assertions Assertions do not consume any character, but they will prevent the regex from matching if not fulfilled. * `^` matches at the start of a line; that is, just after a newline character, or at the subject's beginning (unless it is specified that the subject's beginning is not a start of line). * `$` matches at the end of a line; that is, just before a newline, or at the subject end (unless it is specified that the subject's end is not an end of line). * `\b` matches at a word boundary; which is to say that between the previous character and the current character, one is a word character, and the other is not. * `\B` matches at a non-word boundary; meaning, when both the previous character and the current character are word characters, or both are not. * `\A` matches at the subject string's beginning. * `\z` matches at the subject string's end. * `\K` matches anything, and resets the start position of capture group 0 to the current position. More complex assertions can be expressed with lookarounds: * `(?=...)` is a lookahead; it will match if its content matches the text following the current position. * `(?!...)` is a negative lookahead; it will match if its content does not match the text following the current position. * `(?<=...)` is a lookbehind; it will match if its content matches the text preceding the current position. * `(?; some divergence exists for ease of use, or performance reasons: * Lookarounds are not arbitrary, but lookbehind is supported. * `\K`, `\Q..\E`, `\A`, `\h` and `\z` are added. * Stricter handling of escaping, as we introduce additional escapes; identity escapes like `\X` with `X` being a non-special character are not accepted, to avoid confusions between `\h` meaning literal `h` in ECMAScript, and horizontal blank in Kakoune. * `\uXXXXXX` uses 6 digits to cover all of Unicode, instead of relying on ECMAScript UTF-16 surrogate pairs with 4 digits.