home/doc/pages/regex.asciidoc
Kylie McClain 4ae2102cd8 regex.asciidoc: rephrasing, style, consistency
* Polish some grammar in places.
* Correct some capitalization nitpicks.
* Use "newline" rather than "line feed", which tends to be more common
  in Kakoune's documentation thusfar.

I rephrased some sections, as some of them read a little odd.
* Zero width assertions
    * Consistently use "subject's beginning" instead of "subject begin",
      it reads better.
    * Improve the flow of the word boundary descriptions.
* Modifiers
    * Improve phrasing to emphasize the linear nature of their usage and
      remove a double negative.
    * Use `.` instead of "dot", since that aids in searching through the
      page for things talking about the dot character.
* Compatibility
    * Use asciidoc syntax for the link to the ECMA-262 standard.
    * Use better punctuation on the point about escapes.
2021-05-16 09:16:14 -04:00

193 lines
7.4 KiB
Plaintext

= Regex
== Regex syntax
Kakoune regex syntax is based on ECMAScript syntax, as defined by the
ECMA-262 standard (see <<regex#compatibility,:doc regex compatibility>>).
Kakoune's regex always runs on Unicode codepoint sequences, not on bytes.
== Literals
Every character except the syntax characters `\^$.*+?[]{}|().` match
themselves. Syntax characters can be escaped with a backslash so that
`\$` will match a literal `$`, and `\\` will match a literal `\`.
Some literals are available as escape sequences:
* `\f` matches the form feed character.
* `\n` matches the newline character.
* `\r` matches the carriage return character.
* `\t` matches the tabulation character.
* `\v` matches the vertical tabulation character.
* `\0` matches the null character.
* `\cX` matches the control-`X` character (`X` can be in `[A-Za-z]`).
* `\xXX` matches the character whose codepoint is `XX` (in hexadecimal).
* `\uXXXXXX` matches the character whose codepoint is `XXXXXX` (in hexadecimal).
== Character classes
The `[` character introduces a character class, matching one character
from a set of characters.
A character class contains a list of literals, character ranges,
and character class escapes surrounded by `[` and `]`.
If the first character inside a character class is `^`, then the character
class is negated, meaning that it matches every character not specified
in the character class.
Literals match themselves, including syntax characters, so `^`
does not need to be escaped in a character class. `[\*+]` matches both
the `\*` character and the `+` character. Literal escape sequences are
supported, so `[\n\r]` matches both the newline and carriage return
characters.
The `]` character needs to be escaped for it to match a literal `]`
instead of closing the character class.
Character ranges are written as `<start character>-<end character>`, so
`[A-Z]` matches all uppercase basic letters. `[A-Z0-9]` will match all
uppercase basic letters and all basic digits.
The `-` characters in a character class that are not specifying a
range are treated as literal `-`, so `[A-Z-+]` matches all upper case
characters, the `-` character, and the `+` character.
Supported character class escapes are:
* `\d` which matches all digits.
* `\w` which matches all word characters.
* `\s` which matches all whitespace characters.
* `\h` which matches all horizontal whitespace characters.
Using an upper case letter instead of a lower case one will negate
the character class. For example, `\D` will match every non-digit
character.
Character class escapes can be used outside of a character class, `\d`
is equivalent to `[\d]`.
== Any character
`.` matches any character, including newlines, by default.
(see <<regex#modifiers,:doc regex modifiers>> on how to change it)
== Groups
Regex atoms can be grouped using `(` and `)` or `(?:` and `)`. If `(` is
used, the group will be a capturing group, which means the positions from
the subject strings that matched between `(` and `)` will be recorded.
Capture groups are numbered starting at 1. They are numbered in the
order of appearance of their `(` in the regex. A special capture group
0 is for the whole sequence that matched.
* `(?:` introduces a non capturing group, which will not record the
matching positions.
* `(?<name>` introduces a named capturing group, which, in addition to
being referred by number, can be, in certain contexts, referred by the
given name.
== Alternations
The `|` character introduces an alternation, which will either match
its left-hand side, or its right-hand side (preferring the left-hand side)
For example, `foo|bar` matches either `foo` or `bar`, `foo(bar|baz|qux)`
matches `foo` followed by either `bar`, `baz` or `qux`.
== Quantifier
Literals, character classes, any characters, and groups can be followed
by a quantifier, which specifies the number of times they can match.
* `?` matches zero, or one time.
* `*` matches zero or more times.
* `+` matches one or more times.
* `{n}` matches exactly `n` times.
* `{n,}` matches `n` or more times.
* `{n,m}` matches `n` to `m` times.
* `{,m}` matches zero to `m` times.
By default, quantifiers are *greedy*, which means they will prefer to
match more characters if possible. Suffixing a quantifier with `?` will
make it non-greedy, meaning it will prefer to match as few characters
as possible.
== Zero width assertions
Assertions do not consume any character, but they will prevent the regex
from matching if not fulfilled.
* `^` matches at the start of a line; that is, just after a newline
character, or at the subject's beginning (unless it is specified
that the subject's beginning is not a start of line).
* `$` matches at the end of a line; that is, just before a newline, or
at the subject end (unless it is specified that the subject's end
is not an end of line).
* `\b` matches at a word boundary; which is to say that between the
previous character and the current character, one is a word
character, and the other is not.
* `\B` matches at a non-word boundary; meaning, when both the previous
character and the current character are word characters, or both
are not.
* `\A` matches at the subject string's beginning.
* `\z` matches at the subject string's end.
* `\K` matches anything, and resets the start position of capture group
0 to the current position.
More complex assertions can be expressed with lookarounds:
* `(?=...)` is a lookahead; it will match if its content matches the
text following the current position.
* `(?!...)` is a negative lookahead; it will match if its content does
not match the text following the current position.
* `(?<=...)` is a lookbehind; it will match if its content matches
the text preceding the current position.
* `(?<!...)` is a negative lookbehind; it will match if its content does
not match the text preceding the current position.
For performance reasons, lookaround contents must be a sequence of
literals, character classes, or any character (`.`); quantifiers are not
supported.
For example, `(?<!bar)(?=foo).` will match any character which is not
preceded by `bar` and where `foo` matches from the current position
(which means the character has to be an `f`).
== Modifiers
Some modifiers can control the matching behavior of the atoms following
them:
* `(?i)` starts case-insensitive matching.
* `(?I)` starts case-sensitive matching (default).
* `(?s)` allows `.` to match newlines (default).
* `(?S)` prevents `.` from matching newlines.
== Quoting
`\Q` will start a quoted sequence, where every character is treated as
a literal. That quoted sequence will continue until either the end of
the regex, or the appearance of `\E`.
For example, `.\Q.^$\E$` will match any character followed by the
literal string `.^$`, followed by an end of line.
== Compatibility
Kakoune's syntax tries to follow the ECMAScript regex syntax, as defined
by <https://www.ecma-international.org/ecma-262/8.0/>; some divergence
exists for ease of use, or performance reasons:
* Lookarounds are not arbitrary, but lookbehind is supported.
* `\K`, `\Q..\E`, `\A`, `\h` and `\z` are added.
* Stricter handling of escaping, as we introduce additional escapes;
identity escapes like `\X` with `X` being a non-special character
are not accepted, to avoid confusions between `\h` meaning literal
`h` in ECMAScript, and horizontal blank in Kakoune.
* `\uXXXXXX` uses 6 digits to cover all of Unicode, instead of relying
on ECMAScript UTF-16 surrogate pairs with 4 digits.