regex.asciidoc: rephrasing, style, consistency

* Polish some grammar in places.
* Correct some capitalization nitpicks.
* Use "newline" rather than "line feed", which tends to be more common
  in Kakoune's documentation thusfar.

I rephrased some sections, as some of them read a little odd.
* Zero width assertions
    * Consistently use "subject's beginning" instead of "subject begin",
      it reads better.
    * Improve the flow of the word boundary descriptions.
* Modifiers
    * Improve phrasing to emphasize the linear nature of their usage and
      remove a double negative.
    * Use `.` instead of "dot", since that aids in searching through the
      page for things talking about the dot character.
* Compatibility
    * Use asciidoc syntax for the link to the ECMA-262 standard.
    * Use better punctuation on the point about escapes.
This commit is contained in:
Kylie McClain 2021-05-10 02:40:01 -04:00
parent ead12e11bd
commit 4ae2102cd8

View File

@ -1,29 +1,29 @@
= Regex = Regex
== Regex Syntax == Regex syntax
Kakoune regex syntax is based on the ECMAScript syntax, as defined by the Kakoune regex syntax is based on ECMAScript syntax, as defined by the
ECMA-262 standard (see <<Compatibility>>). ECMA-262 standard (see <<regex#compatibility,:doc regex compatibility>>).
Kakoune's regex always run on Unicode codepoint sequences, not on bytes. Kakoune's regex always runs on Unicode codepoint sequences, not on bytes.
== Literals == Literals
Every character except the syntax characters `\^$.*+?[]{}|().` match Every character except the syntax characters `\^$.*+?[]{}|().` match
themselves. Syntax characters can be escaped with a backslash so `\$` themselves. Syntax characters can be escaped with a backslash so that
will match a literal `$` and `\\` will match a literal `\`. `\$` will match a literal `$`, and `\\` will match a literal `\`.
Some literals are available as escape sequences: Some literals are available as escape sequences:
* `\f` matches the form feed character. * `\f` matches the form feed character.
* `\n` matches the line feed character. * `\n` matches the newline character.
* `\r` matches the carriage return character. * `\r` matches the carriage return character.
* `\t` matches the tabulation character. * `\t` matches the tabulation character.
* `\v` matches the vertical tabulation character. * `\v` matches the vertical tabulation character.
* `\0` matches the null character. * `\0` matches the null character.
* `\cX` matches the control-X character (X can be in `[A-Za-z]`). * `\cX` matches the control-`X` character (`X` can be in `[A-Za-z]`).
* `\xXX` matches the character whose codepoint is XX (in hexadecimal). * `\xXX` matches the character whose codepoint is `XX` (in hexadecimal).
* `\uXXXXXX` matches the character whose codepoint is XXXXXX (in hexadecimal). * `\uXXXXXX` matches the character whose codepoint is `XXXXXX` (in hexadecimal).
== Character classes == Character classes
@ -40,15 +40,15 @@ in the character class.
Literals match themselves, including syntax characters, so `^` Literals match themselves, including syntax characters, so `^`
does not need to be escaped in a character class. `[\*+]` matches both does not need to be escaped in a character class. `[\*+]` matches both
the `\*` character and the `+` character. Literal escape sequences are the `\*` character and the `+` character. Literal escape sequences are
supported, so `[\n\r]` matches both the line feed and carriage return supported, so `[\n\r]` matches both the newline and carriage return
characters. characters.
The `]` character needs to be escaped for it to match a literal `]` The `]` character needs to be escaped for it to match a literal `]`
instead of closing the character class. instead of closing the character class.
Character ranges are written as `<start character>-<end character>`, so Character ranges are written as `<start character>-<end character>`, so
`[A-Z]` matches all upper case basic letters. `[A-Z0-9]` will match all `[A-Z]` matches all uppercase basic letters. `[A-Z0-9]` will match all
upper cases basic letters and all basic digits. uppercase basic letters and all basic digits.
The `-` characters in a character class that are not specifying a The `-` characters in a character class that are not specifying a
range are treated as literal `-`, so `[A-Z-+]` matches all upper case range are treated as literal `-`, so `[A-Z-+]` matches all upper case
@ -62,15 +62,16 @@ Supported character class escapes are:
* `\h` which matches all horizontal whitespace characters. * `\h` which matches all horizontal whitespace characters.
Using an upper case letter instead of a lower case one will negate Using an upper case letter instead of a lower case one will negate
the character class, meaning for example that `\D` will match every the character class. For example, `\D` will match every non-digit
non-digit character. character.
Character class escapes can be used outside of a character class, `\d` Character class escapes can be used outside of a character class, `\d`
is equivalent to `[\d]`. is equivalent to `[\d]`.
== Any character == Any character
`.` matches any character, including new lines. `.` matches any character, including newlines, by default.
(see <<regex#modifiers,:doc regex modifiers>> on how to change it)
== Groups == Groups
@ -99,16 +100,16 @@ matches `foo` followed by either `bar`, `baz` or `qux`.
== Quantifier == Quantifier
Literals, Character classes, Any characters and groups can be followed Literals, character classes, any characters, and groups can be followed
by a quantifier, which specifies the number of times they can match. by a quantifier, which specifies the number of times they can match.
* `?` matches zero or one times. * `?` matches zero, or one time.
* `*` matches zero or more times. * `*` matches zero or more times.
* `+` matches one or more times. * `+` matches one or more times.
* `{n}` matches exactly n times. * `{n}` matches exactly `n` times.
* `{n,}` matches n or more times. * `{n,}` matches `n` or more times.
* `{n,m}` matches n to m times. * `{n,m}` matches `n` to `m` times.
* `{,m}` matches zero to m times. * `{,m}` matches zero to `m` times.
By default, quantifiers are *greedy*, which means they will prefer to By default, quantifiers are *greedy*, which means they will prefer to
match more characters if possible. Suffixing a quantifier with `?` will match more characters if possible. Suffixing a quantifier with `?` will
@ -117,37 +118,40 @@ as possible.
== Zero width assertions == Zero width assertions
Assertions do not consume any character, but will prevent the regex Assertions do not consume any character, but they will prevent the regex
from matching if they are not fulfilled. from matching if not fulfilled.
* `^` matches at the start of a line, that is just after a new line * `^` matches at the start of a line; that is, just after a newline
character, or at the subject begin (except if specified that the character, or at the subject's beginning (unless it is specified
subject begin is not a start of line). that the subject's beginning is not a start of line).
* `$` matches at the end of a line, that is just before a new line, or * `$` matches at the end of a line; that is, just before a newline, or
at the subject end (except if specified that the subject's end at the subject end (unless it is specified that the subject's end
is not an end of line). is not an end of line).
* `\b` matches at a word boundary, when one of the previous character * `\b` matches at a word boundary; which is to say that between the
and current character is a word character, and the other is not. previous character and the current character, one is a word
* `\B` matches at a non word boundary, when both the previous character character, and the other is not.
and the current character are word, or are not. * `\B` matches at a non-word boundary; meaning, when both the previous
* `\A` matches at the subject string begin. character and the current character are word characters, or both
* `\z` matches at the subject string end. are not.
* `\K` matches anything, and resets the start position of the capture * `\A` matches at the subject string's beginning.
group 0 to the current position. * `\z` matches at the subject string's end.
* `\K` matches anything, and resets the start position of capture group
0 to the current position.
More complex assertions can be expressed with lookarounds: More complex assertions can be expressed with lookarounds:
* `(?=...)` is a lookahead, it will match if its content matches the text * `(?=...)` is a lookahead; it will match if its content matches the
following the current position text following the current position.
* `(?!...)` is a negative lookahead, it will match if its content does * `(?!...)` is a negative lookahead; it will match if its content does
not match the text following the current position not match the text following the current position.
* `(?<=...)` is a lookbehind, it will match if its content matches * `(?<=...)` is a lookbehind; it will match if its content matches
the text preceding the current position the text preceding the current position.
* `(?<!...)` is a negative lookbehind, it will match if its content does * `(?<!...)` is a negative lookbehind; it will match if its content does
not match the text preceding the current position not match the text preceding the current position.
For performance reasons lookaround contents must be sequence of literals, For performance reasons, lookaround contents must be a sequence of
character classes or any-character (`.`); Quantifiers are not supported. literals, character classes, or any character (`.`); quantifiers are not
supported.
For example, `(?<!bar)(?=foo).` will match any character which is not For example, `(?<!bar)(?=foo).` will match any character which is not
preceded by `bar` and where `foo` matches from the current position preceded by `bar` and where `foo` matches from the current position
@ -158,10 +162,10 @@ preceded by `bar` and where `foo` matches from the current position
Some modifiers can control the matching behavior of the atoms following Some modifiers can control the matching behavior of the atoms following
them: them:
* `(?i)` enables case-insensitive matching * `(?i)` starts case-insensitive matching.
* `(?I)` disables case-insensitive matching (default) * `(?I)` starts case-sensitive matching (default).
* `(?s)` enables dot-matches-newline (default) * `(?s)` allows `.` to match newlines (default).
* `(?S)` disables dot-matches-newline * `(?S)` prevents `.` from matching newlines.
== Quoting == Quoting
@ -169,20 +173,20 @@ them:
a literal. That quoted sequence will continue until either the end of a literal. That quoted sequence will continue until either the end of
the regex, or the appearance of `\E`. the regex, or the appearance of `\E`.
For example `.\Q.^$\E$` will match any character followed by the literal For example, `.\Q.^$\E$` will match any character followed by the
string `.^$` followed by an end of line. literal string `.^$`, followed by an end of line.
== Compatibility == Compatibility
The syntax tries to follow the ECMAScript regex syntax as defined by Kakoune's syntax tries to follow the ECMAScript regex syntax, as defined
https://www.ecma-international.org/ecma-262/8.0/ some divergences by <https://www.ecma-international.org/ecma-262/8.0/>; some divergence
exists for ease of use or performance reasons: exists for ease of use, or performance reasons:
* lookarounds are not arbitrary, but lookbehind is supported. * Lookarounds are not arbitrary, but lookbehind is supported.
* `\K`, `\Q..\E`, `\A`, `\h` and `\z` are added. * `\K`, `\Q..\E`, `\A`, `\h` and `\z` are added.
* Stricter handling of escaping, as we introduce additional * Stricter handling of escaping, as we introduce additional escapes;
escapes, identity escapes like `\X` with X a non-special character identity escapes like `\X` with `X` being a non-special character
are not accepted, to avoid confusions between `\h` meaning literal are not accepted, to avoid confusions between `\h` meaning literal
`h` in ECMAScript, and horizontal blank in Kakoune. `h` in ECMAScript, and horizontal blank in Kakoune.
* `\uXXXXXX` uses 6 digits to cover all of unicode, instead of relying * `\uXXXXXX` uses 6 digits to cover all of Unicode, instead of relying
on ECMAScript UTF-16 surrogate pairs with 4 digits. on ECMAScript UTF-16 surrogate pairs with 4 digits.