kakoune/doc/pages/regex.asciidoc

= Regex

== Regex Syntax

Kakoune regex syntax is based on the ECMAScript syntax, as defined by the
ECMA-262 standard (see <<Compatibility>>).

Kakoune's regex always run on Unicode codepoint sequences, not on bytes.

== Literals

Every character except the syntax characters `\^$.*+?[]{}|().` match
themselves. Syntax characters can be escaped with a backslash so `\$`
will match a literal `$` and `\\` will match a literal `\`.

Some literals are available as escape sequences:

* `\f` matches the form feed character.
* `\n` matches the line feed character.
* `\r` matches the carriage return character.
* `\t` matches the tabulation character.
* `\v` matches the vertical tabulation character.
* `\0` matches the null character.
* `\cX` matches the control-X character (X can be in `[A-Za-z]`).
* `\xXX` matches the character whose codepoint is XX (in hexadecimal).
* `\uXXXXXX` matches the character whose codepoint is XXXXXX (in hexadecimal).

== Character classes

The `[` character introduces a character class, matching one character
from a set of characters.

A character class contains a list of literals, character ranges,
and character class escapes surrounded by `[` and `]`.

If the first character inside a character class is `^`, then the character
class is negated, meaning that it matches every character not specified
in the character class.

Literals match themselves, including syntax characters, so `^`
does not need to be escaped in a character class. `[*+]` matches both
the `*` character and the `+` character. Literal escape sequences are
supported, so `[\n\r]` matches both the line feed and carriage return
characters.

The `]` character needs to be escaped for it to match a literal `]`
instead of closing the character class.

Character ranges are written as `<start character>-<end character>`, so
`[A-Z]` matches all upper case basic letters. `[A-Z0-9]` will match all
upper cases basic letters and all basic digits.

The `-` characters in a character class that are not specifying a
range are treated as literal `-`, so `[A-Z-+]` matches all upper case
characters, the `-` character, and the `+` character.

Supported character class escapes are:

* `\d` which matches all digits.
* `\w` which matches all word characters.
* `\s` which matches all whitespace characters.
* `\h` which matches all horizontal whitespace characters.

Using an upper case letter instead of a lower case one will negate
the character class, meaning for example that `\D` will match every
non-digit character.

Character class escapes can be used outside of a character class, `\d`
is equivalent to `[\d]`.

== Any character

`.` matches any character, including new lines.

== Groups

Regex atoms can be grouped using `(` and `)` or `(?:` and `)`. If `(` is
used, the group will be a capturing group, which means the positions from
the subject strings that matched between `(` and `)` will be recorded.

Capture groups are numbered starting at 1. They are numbered in the
order of appearance of their `(` in the regex. A special capture group
0 is for the whole sequence that matched.

* `(?:` introduces a non capturing group, which will not record the
matching positions.

* `(?<name>` introduces a named capturing group, which, in addition to
being referred by number, can be, in certain contexts, referred by the
given name.

== Alternations

The `|` character introduces an alternation, which will either match
its left-hand side, or its right-hand side (preferring the left-hand side)

For example, `foo|bar` matches either `foo` or `bar`, `foo(bar|baz|qux)`
matches `foo` followed by either `bar`, `baz` or `qux`.

== Quantifier

Literals, Character classes, Any characters and groups can be followed
by a quantifier, which specifies the number of times they can match.

* `?` matches zero or one times.
* `*` matches zero or more times.
* `+` matches one or more times.
* `{n}` matches exactly n times.
* `{n,}` matches n or more times.
* `{n,m}` matches n to m times.
* `{,m}` matches zero to m times.

By default, quantifiers are *greedy*, which means they will prefer to
match more characters if possible. Suffixing a quantifier with `?` will
make it non-greedy, meaning it will prefer to match as few characters
as possible.

== Zero width assertions

Assertions do not consume any character, but will prevent the regex
from matching if they are not fulfilled.

* `^` matches at the start of a line, that is just after a new line
      character, or at the subject begin (except if specified that the
      subject begin is not a start of line).
* `$` matches at the end of a line, that is just before a new line, or
      at the subject end (except if specified that the subject's end
      is not an end of line).
* `\b` matches at a word boundary, when one of the previous character
       and current character is a word character, and the other is not.
* `\B` matches at a non word boundary, when both the previous character
       and the current character are word, or are not.
* `\A` matches at the subject string begin.
* `\z` matches at the subject string end.
* `\K` matches anything, and resets the start position of the capture
       group 0 to the current position.

More complex assertions can be expressed with lookarounds:

* `(?=...)` is a lookahead, it will match if its content matches the text
            following the current position
* `(?!...)` is a negative lookahead, it will match if its content does
            not match the text following the current position
* `(?<=...)` is a lookbehind, it will match if its content matches
             the text preceding the current position
* `(?<!...)` is a negative lookbehind, it will match if its content does
             not match the text preceding the current position

For performance reasons lookaround contents must be sequence of literals,
character classes or any-character (`.`); Quantifiers are not supported.

For example, `(?<!bar)(?=foo).` will match any character which is not
preceded by `bar` and where `foo` matches from the current position
(which means the character has to be an `f`).

== Modifiers

Some modifiers can control the matching behavior of the atoms following
them:

* `(?i)` enables case-insensitive matching
* `(?I)` disables case-insensitive matching (default)
* `(?s)` enables dot-matches-newline (default)
* `(?S)` disables dot-matches-newline

== Quoting

`\Q` will start a quoted sequence, where every character is treated as
a literal. That quoted sequence will continue until either the end of
the regex, or the appearance of `\E`.

For example `.\Q.^$\E$` will match any character followed by the literal
string `.^$` followed by an end of line.

== Compatibility

The syntax tries to follow the ECMAScript regex syntax as defined by
https://www.ecma-international.org/ecma-262/8.0/ some divergences
exists for ease of use or performance reasons:

* lookarounds are not arbitrary, but lookbehind is supported.
* `\K`, `\Q..\E`, `\A`, `\h` and `\z` are added.
* Stricter handling of escaping, as we introduce additional
  escapes, identity escapes like `\X` with X a non-special character
  are not accepted, to avoid confusions between `\h` meaning literal
  `h` in ECMAScript, and horizontal blank in Kakoune.
* `\uXXXXXX` uses 6 digits to cover all of unicode, instead of relying
  on ECMAScript UTF-16 surrogate pairs with 4 digits.
doc.kak: Render documentation internally instead of relying on man doc.kak now behaves as a basic asciidoc renderer. Asciidoc is unfortunately still a dependency to generate the manpage of the `kak` command. 2017-11-02 03:03:24 +01:00			`= Regex`
Regex: add a regex.asciidoc documentation page describing the syntax 2017-10-13 07:14:31 +02:00
doc.kak: Render documentation internally instead of relying on man doc.kak now behaves as a basic asciidoc renderer. Asciidoc is unfortunately still a dependency to generate the manpage of the `kak` command. 2017-11-02 03:03:24 +01:00			`== Regex Syntax`
Regex: add a regex.asciidoc documentation page describing the syntax 2017-10-13 07:14:31 +02:00
			`Kakoune regex syntax is based on the ECMAScript syntax, as defined by the`
Regex: Add a Compatibility section to the regex documentation Refer more explicitely to ECMAScript and document the incompatibilities with it. 2017-10-26 07:52:29 +02:00			`ECMA-262 standard (see <<Compatibility>>).`
Regex: add a regex.asciidoc documentation page describing the syntax 2017-10-13 07:14:31 +02:00
Regex: Fix a few mistakes in the documentation 2017-10-13 08:42:58 +02:00			`Kakoune's regex always run on Unicode codepoint sequences, not on bytes.`
Regex: add a regex.asciidoc documentation page describing the syntax 2017-10-13 07:14:31 +02:00
doc.kak: Render documentation internally instead of relying on man doc.kak now behaves as a basic asciidoc renderer. Asciidoc is unfortunately still a dependency to generate the manpage of the `kak` command. 2017-11-02 03:03:24 +01:00			`== Literals`
Regex: add a regex.asciidoc documentation page describing the syntax 2017-10-13 07:14:31 +02:00
			Every character except the syntax characters `\^$.*+?[]{}\|().` match
Typo: "escaped with a backspace" -> backslash 2019-12-03 20:36:02 +01:00			themselves. Syntax characters can be escaped with a backslash so `\$`
Regex: add support for \0, \cX, \xXX and \uXXXX escapes 2017-10-20 06:08:24 +02:00			will match a literal `$` and `\\` will match a literal `\`.
Regex: add a regex.asciidoc documentation page describing the syntax 2017-10-13 07:14:31 +02:00
Regex: add support for \0, \cX, \xXX and \uXXXX escapes 2017-10-20 06:08:24 +02:00			`Some literals are available as escape sequences:`
Regex: add a regex.asciidoc documentation page describing the syntax 2017-10-13 07:14:31 +02:00
			* `\f` matches the form feed character.
			* `\n` matches the line feed character.
			* `\r` matches the carriage return character.
			* `\t` matches the tabulation character.
Regex: Fix a few mistakes in the documentation 2017-10-13 08:42:58 +02:00			* `\v` matches the vertical tabulation character.
Regex: add support for \0, \cX, \xXX and \uXXXX escapes 2017-10-20 06:08:24 +02:00			* `\0` matches the null character.
			* `\cX` matches the control-X character (X can be in `[A-Za-z]`).
			* `\xXX` matches the character whose codepoint is XX (in hexadecimal).
Support \x and \u escapes in regex character classes Change \u to use 6 digits to cover the full unicode range. Fixes #3172 2019-11-06 10:48:48 +01:00			* `\uXXXXXX` matches the character whose codepoint is XXXXXX (in hexadecimal).
Regex: add a regex.asciidoc documentation page describing the syntax 2017-10-13 07:14:31 +02:00
doc.kak: Render documentation internally instead of relying on man doc.kak now behaves as a basic asciidoc renderer. Asciidoc is unfortunately still a dependency to generate the manpage of the `kak` command. 2017-11-02 03:03:24 +01:00			`== Character classes`
Regex: add a regex.asciidoc documentation page describing the syntax 2017-10-13 07:14:31 +02:00
Regex: apply danr's suggested changes to the regex syntax documentation 2017-10-16 03:35:03 +02:00			The `[` character introduces a character class, matching one character
			`from a set of characters.`
Regex: add a regex.asciidoc documentation page describing the syntax 2017-10-13 07:14:31 +02:00
			`A character class contains a list of literals, character ranges,`
			and character class escapes surrounded by `[` and `]`.

			If the first character inside a character class is `^`, then the character
			`class is negated, meaning that it matches every character not specified`
			`in the character class.`

			Literals match themselves, including syntax characters, so `^`
			does not need to be escaped in a character class. `[*+]` matches both
			the `*` character and the `+` character. Literal escape sequences are
			supported, so `[\n\r]` matches both the line feed and carriage return
			`characters.`

			The `]` character needs to be escaped for it to match a literal `]`
			`instead of closing the character class.`

			Character ranges are written as `<start character>-<end character>`, so
			`[A-Z]` matches all upper case basic letters. `[A-Z0-9]` will match all
			`upper cases basic letters and all basic digits.`

			The `-` characters in a character class that are not specifying a
			range are treated as literal `-`, so `[A-Z-+]` matches all upper case
			characters, the `-` character, and the `+` character.

Regex: Fix a few mistakes in the documentation 2017-10-13 08:42:58 +02:00			`Supported character class escapes are:`
Regex: add a regex.asciidoc documentation page describing the syntax 2017-10-13 07:14:31 +02:00
			* `\d` which matches all digits.
			* `\w` which matches all word characters.
			* `\s` which matches all whitespace characters.
			* `\h` which matches all horizontal whitespace characters.

Regex: Fix a few mistakes in the documentation 2017-10-13 08:42:58 +02:00			`Using an upper case letter instead of a lower case one will negate`
Regex: add a regex.asciidoc documentation page describing the syntax 2017-10-13 07:14:31 +02:00			the character class, meaning for example that `\D` will match every
			`non-digit character.`

Regex: Fix a few mistakes in the documentation 2017-10-13 08:42:58 +02:00			Character class escapes can be used outside of a character class, `\d`
Regex: add a regex.asciidoc documentation page describing the syntax 2017-10-13 07:14:31 +02:00			is equivalent to `[\d]`.

doc.kak: Render documentation internally instead of relying on man doc.kak now behaves as a basic asciidoc renderer. Asciidoc is unfortunately still a dependency to generate the manpage of the `kak` command. 2017-11-02 03:03:24 +01:00			`== Any character`
Regex: add a regex.asciidoc documentation page describing the syntax 2017-10-13 07:14:31 +02:00
			`.` matches any character, including new lines.

doc.kak: Render documentation internally instead of relying on man doc.kak now behaves as a basic asciidoc renderer. Asciidoc is unfortunately still a dependency to generate the manpage of the `kak` command. 2017-11-02 03:03:24 +01:00			`== Groups`
Regex: add a regex.asciidoc documentation page describing the syntax 2017-10-13 07:14:31 +02:00
			Regex atoms can be grouped using `(` and `)` or `(?:` and `)`. If `(` is
Regex: Fix a few mistakes in the documentation 2017-10-13 08:42:58 +02:00			`used, the group will be a capturing group, which means the positions from`
Regex: add a regex.asciidoc documentation page describing the syntax 2017-10-13 07:14:31 +02:00			the subject strings that matched between `(` and `)` will be recorded.

Add support for named captures to the regex impl and regex highlighter ECMAScript is adding support for it, and it is a pretty isolated change to do. Fixes #2293 2019-01-03 12:52:15 +01:00			`Capture groups are numbered starting at 1. They are numbered in the`
			order of appearance of their `(` in the regex. A special capture group
			`0 is for the whole sequence that matched.`
Regex: add a regex.asciidoc documentation page describing the syntax 2017-10-13 07:14:31 +02:00
Add support for named captures to the regex impl and regex highlighter ECMAScript is adding support for it, and it is a pretty isolated change to do. Fixes #2293 2019-01-03 12:52:15 +01:00			* `(?:` introduces a non capturing group, which will not record the
Regex: apply danr's suggested changes to the regex syntax documentation 2017-10-16 03:35:03 +02:00			`matching positions.`
Regex: add a regex.asciidoc documentation page describing the syntax 2017-10-13 07:14:31 +02:00
Add support for named captures to the regex impl and regex highlighter ECMAScript is adding support for it, and it is a pretty isolated change to do. Fixes #2293 2019-01-03 12:52:15 +01:00			* `(?<name>` introduces a named capturing group, which, in addition to
			`being referred by number, can be, in certain contexts, referred by the`
			`given name.`

doc.kak: Render documentation internally instead of relying on man doc.kak now behaves as a basic asciidoc renderer. Asciidoc is unfortunately still a dependency to generate the manpage of the `kak` command. 2017-11-02 03:03:24 +01:00			`== Alternations`
Regex: add a regex.asciidoc documentation page describing the syntax 2017-10-13 07:14:31 +02:00
Add support for named captures to the regex impl and regex highlighter ECMAScript is adding support for it, and it is a pretty isolated change to do. Fixes #2293 2019-01-03 12:52:15 +01:00			The `\|` character introduces an alternation, which will either match
			`its left-hand side, or its right-hand side (preferring the left-hand side)`
Regex: add a regex.asciidoc documentation page describing the syntax 2017-10-13 07:14:31 +02:00
			For example, `foo\|bar` matches either `foo` or `bar`, `foo(bar\|baz\|qux)`
			matches `foo` followed by either `bar`, `baz` or `qux`.

doc.kak: Render documentation internally instead of relying on man doc.kak now behaves as a basic asciidoc renderer. Asciidoc is unfortunately still a dependency to generate the manpage of the `kak` command. 2017-11-02 03:03:24 +01:00			`== Quantifier`
Regex: add a regex.asciidoc documentation page describing the syntax 2017-10-13 07:14:31 +02:00
			`Literals, Character classes, Any characters and groups can be followed`
			`by a quantifier, which specifies the number of times they can match.`

			* `?` matches zero or one times.
			* `*` matches zero or more times.
			* `+` matches one or more times.
			* `{n}` matches exactly n times.
			* `{n,}` matches n or more times.
			* `{n,m}` matches n to m times.
			* `{,m}` matches zero to m times.

			`By default, quantifiers are greedy, which means they will prefer to`
			match more characters if possible. Suffixing a quantifier with `?` will
Regex: apply danr's suggested changes to the regex syntax documentation 2017-10-16 03:35:03 +02:00			`make it non-greedy, meaning it will prefer to match as few characters`
			`as possible.`
Regex: add a regex.asciidoc documentation page describing the syntax 2017-10-13 07:14:31 +02:00
doc.kak: Render documentation internally instead of relying on man doc.kak now behaves as a basic asciidoc renderer. Asciidoc is unfortunately still a dependency to generate the manpage of the `kak` command. 2017-11-02 03:03:24 +01:00			`== Zero width assertions`
Regex: add a regex.asciidoc documentation page describing the syntax 2017-10-13 07:14:31 +02:00
			`Assertions do not consume any character, but will prevent the regex`
			`from matching if they are not fulfilled.`

			* `^` matches at the start of a line, that is just after a new line
			`character, or at the subject begin (except if specified that the`
			`subject begin is not a start of line).`
			* `$` matches at the end of a line, that is just before a new line, or
Regex: Fix a few mistakes in the documentation 2017-10-13 08:42:58 +02:00			`at the subject end (except if specified that the subject's end`
Regex: add a regex.asciidoc documentation page describing the syntax 2017-10-13 07:14:31 +02:00			`is not an end of line).`
			* `\b` matches at a word boundary, when one of the previous character
			`and current character is a word character, and the other is not.`
			* `\B` matches at a non word boundary, when both the previous character
			`and the current character are word, or are not.`
			* `\A` matches at the subject string begin.
			* `\z` matches at the subject string end.
Regex: apply danr's suggested changes to the regex syntax documentation 2017-10-16 03:35:03 +02:00			* `\K` matches anything, and resets the start position of the capture
			`group 0 to the current position.`
Regex: add a regex.asciidoc documentation page describing the syntax 2017-10-13 07:14:31 +02:00
			`More complex assertions can be expressed with lookarounds:`

			* `(?=...)` is a lookahead, it will match if its content matches the text
			`following the current position`
			* `(?!...)` is a negative lookahead, it will match if its content does
Regex: Fix a few mistakes in the documentation 2017-10-13 08:42:58 +02:00			`not match the text following the current position`
Regex: add a regex.asciidoc documentation page describing the syntax 2017-10-13 07:14:31 +02:00			* `(?<=...)` is a lookbehind, it will match if its content matches
			`the text preceding the current position`
			* `(?<!...)` is a negative lookbehind, it will match if its content does
Regex: Fix a few mistakes in the documentation 2017-10-13 08:42:58 +02:00			`not match the text preceding the current position`
Regex: add a regex.asciidoc documentation page describing the syntax 2017-10-13 07:14:31 +02:00
Regex: change description of lookarounds limitations 2017-10-16 03:38:02 +02:00			`For performance reasons lookaround contents must be sequence of literals,`
doc: fix typos and clarify some parts 2020-05-28 18:37:26 +02:00			character classes or any-character (`.`); Quantifiers are not supported.
Regex: add a regex.asciidoc documentation page describing the syntax 2017-10-13 07:14:31 +02:00
			For example, `(?<!bar)(?=foo).` will match any character which is not
			preceded by `bar` and where `foo` matches from the current position
			(which means the character has to be an `f`).

doc.kak: Render documentation internally instead of relying on man doc.kak now behaves as a basic asciidoc renderer. Asciidoc is unfortunately still a dependency to generate the manpage of the `kak` command. 2017-11-02 03:03:24 +01:00			`== Modifiers`
Regex: add a regex.asciidoc documentation page describing the syntax 2017-10-13 07:14:31 +02:00
Regex: Fix a few mistakes in the documentation 2017-10-13 08:42:58 +02:00			`Some modifiers can control the matching behavior of the atoms following`
Regex: add a regex.asciidoc documentation page describing the syntax 2017-10-13 07:14:31 +02:00			`them:`

Regex: Fix a few mistakes in the documentation 2017-10-13 08:42:58 +02:00			* `(?i)` enables case-insensitive matching
Add support for regex flag to toggle dot-matches-newline 2018-06-04 18:00:59 +02:00			* `(?I)` disables case-insensitive matching (default)
			* `(?s)` enables dot-matches-newline (default)
			* `(?S)` disables dot-matches-newline
Regex: add a regex.asciidoc documentation page describing the syntax 2017-10-13 07:14:31 +02:00
doc.kak: Render documentation internally instead of relying on man doc.kak now behaves as a basic asciidoc renderer. Asciidoc is unfortunately still a dependency to generate the manpage of the `kak` command. 2017-11-02 03:03:24 +01:00			`== Quoting`
Regex: add a regex.asciidoc documentation page describing the syntax 2017-10-13 07:14:31 +02:00
			`\Q` will start a quoted sequence, where every character is treated as
			`a literal. That quoted sequence will continue until either the end of`
			the regex, or the appearance of `\E`.

			For example `.\Q.^$\E$` will match any character followed by the literal
			string `.^$` followed by an end of line.
Regex: Add a Compatibility section to the regex documentation Refer more explicitely to ECMAScript and document the incompatibilities with it. 2017-10-26 07:52:29 +02:00
doc.kak: Render documentation internally instead of relying on man doc.kak now behaves as a basic asciidoc renderer. Asciidoc is unfortunately still a dependency to generate the manpage of the `kak` command. 2017-11-02 03:03:24 +01:00			`== Compatibility`
Regex: Add a Compatibility section to the regex documentation Refer more explicitely to ECMAScript and document the incompatibilities with it. 2017-10-26 07:52:29 +02:00
			`The syntax tries to follow the ECMAScript regex syntax as defined by`
			`https://www.ecma-international.org/ecma-262/8.0/ some divergences`
			`exists for ease of use or performance reasons:`

doc: fix typos and clarify some parts 2020-05-28 18:37:26 +02:00			`* lookarounds are not arbitrary, but lookbehind is supported.`
Regex: Add a Compatibility section to the regex documentation Refer more explicitely to ECMAScript and document the incompatibilities with it. 2017-10-26 07:52:29 +02:00			* `\K`, `\Q..\E`, `\A`, `\h` and `\z` are added.
			`* Stricter handling of escaping, as we introduce additional`
			escapes, identity escapes like `\X` with X a non-special character
			are not accepted, to avoid confusions between `\h` meaning literal
			`h` in ECMAScript, and horizontal blank in Kakoune.
Support \x and \u escapes in regex character classes Change \u to use 6 digits to cover the full unicode range. Fixes #3172 2019-11-06 10:48:48 +01:00			* `\uXXXXXX` uses 6 digits to cover all of unicode, instead of relying
			`on ECMAScript UTF-16 surrogate pairs with 4 digits.`