Skip site navigation (1)Skip section navigation (2)

FreeBSD Manual Pages

  
 
  

home | help
PCRE2PATTERN(3)		   Library Functions Manual	       PCRE2PATTERN(3)

NAME
       PCRE2 - Perl-compatible regular expressions (revised API)

PCRE2 REGULAR EXPRESSION DETAILS

       The  syntax and semantics of the	regular	expressions that are supported
       by PCRE2	are described in detail	below. There is	a quick-reference syn-
       tax summary in the pcre2syntax page. PCRE2 tries	to match  Perl	syntax
       and  semantics as closely as it can.  PCRE2 also	supports some alterna-
       tive regular expression syntax that does	not  conflict  with  the  Perl
       syntax  in order	to provide some	compatibility with regular expressions
       in Python, .NET,	and Oniguruma. There are in addition some options that
       enable alternative syntax and semantics that are	not  the  same	as  in
       Perl.

       Perl's  regular expressions are described in its	own documentation, and
       regular expressions in general are covered in a number of  books,  some
       of which	have copious examples. Jeffrey Friedl's	"Mastering Regular Ex-
       pressions",  published by O'Reilly, covers regular expressions in great
       detail. This description	of PCRE2's regular expressions is intended  as
       reference material.

       This  document  discusses the regular expression	patterns that are sup-
       ported by PCRE2 when its	 main  matching	 function,  pcre2_match(),  is
       used.	PCRE2	 also	 has   an   alternative	  matching   function,
       pcre2_dfa_match(), which	matches	using a	different  algorithm  that  is
       not  Perl-compatible.  Some  of	the  features  discussed below are not
       available when DFA matching is used. The	advantages  and	 disadvantages
       of  the	alternative function, and how it differs from the normal func-
       tion, are discussed in the pcre2matching	page.

EBCDIC CHARACTER CODES

       Most computers use ASCII	or Unicode for encoding	characters, and	 PCRE2
       assumes this by default.	However, it can	be compiled to run in an envi-
       ronment that uses the EBCDIC code, which	is the case for	some IBM main-
       frame  operating	 systems. In the sections below, character code	values
       are ASCII or Unicode; in	an EBCDIC  environment	these  characters  may
       have  different	code values, and there are no code points greater than
       255. Differences	in behaviour when PCRE2	is running in an EBCDIC	 envi-
       ronment are described in	the section "EBCDIC environments" below, which
       you can ignore unless you really	are in an EBCDIC environment.

SPECIAL	START-OF-PATTERN ITEMS

       A  number  of options that can be passed	to pcre2_compile() can also be
       set by special items at the start of a pattern. These are not Perl-com-
       patible,	but are	provided to make these options accessible  to  pattern
       writers	who are	not able to change the program that processes the pat-
       tern. Any number	of these items may appear, but they must  all  be  to-
       gether  right  at the start of the pattern string, and the letters must
       be in upper case.

   UTF support

       In the 8-bit and	16-bit PCRE2 libraries,	characters may be coded	either
       as single code units, or	as multiple UTF-8 or UTF-16 code units.	UTF-32
       can be specified	for the	32-bit library,	in which  case	it  constrains
       the  character  values  to  valid  Unicode  code	points.	To process UTF
       strings,	PCRE2 must be built to include Unicode support (which  is  the
       default).  When	using  UTF  strings you	must either call the compiling
       function	with one or both of the	PCRE2_UTF  or  PCRE2_MATCH_INVALID_UTF
       options,	 or  the  pattern must start with the special sequence (*UTF),
       which is	equivalent to setting the relevant PCRE2_UTF.  How  setting  a
       UTF mode	affects	pattern	matching is mentioned in several places	below.
       There is	also a summary of features in the pcre2unicode page.

       Some applications that allow their users	to supply patterns may wish to
       restrict	  them	 to   non-UTF	data  for  security  reasons.  If  the
       PCRE2_NEVER_UTF option is passed	to pcre2_compile(), (*UTF) is not  al-
       lowed, and its appearance in a pattern causes an	error.

   Unicode property support

       Another	special	 sequence that may appear at the start of a pattern is
       (*UCP).	This has the same effect as setting the	PCRE2_UCP  option:  it
       causes  sequences such as \d and	\w to use Unicode properties to	deter-
       mine character types, instead of	recognizing only characters with codes
       less than 256 via a lookup table. If also causes	upper/lower casing op-
       erations	to use Unicode properties  for	characters  with  code	points
       greater	than  127,  even when UTF is not set.  These behaviours	can be
       changed within the pattern; see the section entitled  "Internal	Option
       Setting"	below.

       Some applications that allow their users	to supply patterns may wish to
       restrict	 them  for  security reasons. If the PCRE2_NEVER_UCP option is
       passed to pcre2_compile(), (*UCP) is not	allowed, and its appearance in
       a pattern causes	an error.

   Locking out empty string matching

       Starting	a pattern with (*NOTEMPTY) or (*NOTEMPTY_ATSTART) has the same
       effect as passing the PCRE2_NOTEMPTY or	PCRE2_NOTEMPTY_ATSTART	option
       to whichever matching function is subsequently called to	match the pat-
       tern.  These options lock out the matching of empty strings, either en-
       tirely, or only at the start of the subject.

   Disabling auto-possessification

       If a pattern starts with	(*NO_AUTO_POSSESS), it has the same effect  as
       setting	the  PCRE2_NO_AUTO_POSSESS  option, or calling pcre2_set_opti-
       mize() with a PCRE2_AUTO_POSSESS_OFF directive. This stops  PCRE2  from
       making  quantifiers  possessive	when what follows cannot match the re-
       peated item. For	example, by default a+b	is treated as a++b.  For  more
       details,	see the	pcre2api documentation.

   Disabling start-up optimizations

       If  a  pattern  starts  with (*NO_START_OPT), it	has the	same effect as
       setting the PCRE2_NO_START_OPTIMIZE option, or calling  pcre2_set_opti-
       mize() with a PCRE2_START_OPTIMIZE_OFF directive. This disables several
       optimizations  for  quickly  reaching  "no match" results. For more de-
       tails, see the pcre2api documentation.

   Disabling automatic anchoring

       If a pattern starts with	(*NO_DOTSTAR_ANCHOR), it has the  same	effect
       as setting the PCRE2_NO_DOTSTAR_ANCHOR option, or calling pcre2_set_op-
       timize()	with a PCRE2_DOTSTAR_ANCHOR_OFF	directive. This	disables opti-
       mizations  that	apply  to  patterns whose top-level branches all start
       with .* (match any number of arbitrary characters). For	more  details,
       see the pcre2api	documentation.

   Disabling JIT compilation

       If  a  pattern  that starts with	(*NO_JIT) is successfully compiled, an
       attempt by the application to apply the	JIT  optimization  by  calling
       pcre2_jit_compile() is ignored.

   Setting match resource limits

       The pcre2_match() function contains a counter that is incremented every
       time it goes round its main loop. The caller of pcre2_match() can set a
       limit  on  this counter,	which therefore	limits the amount of computing
       resource	used for a match. The maximum depth of nested backtracking can
       also be limited;	this indirectly	restricts the amount  of  heap	memory
       that  is	 used,	but there is also an explicit memory limit that	can be
       set.

       These facilities	are provided to	catch runaway matches  that  are  pro-
       voked  by patterns with huge matching trees. A common example is	a pat-
       tern with nested	unlimited repeats applied to a long string  that  does
       not  match. When	one of these limits is reached,	pcre2_match() gives an
       error return. The limits	can also be set	by items at the	start  of  the
       pattern of the form

	 (*LIMIT_HEAP=d)
	 (*LIMIT_MATCH=d)
	 (*LIMIT_DEPTH=d)

       where d is any number of	decimal	digits.	However, the value of the set-
       ting  must  be  less than the value set (or defaulted) by the caller of
       pcre2_match() for it to have any	effect.	In other  words,  the  pattern
       writer  can lower the limits set	by the programmer, but not raise them.
       If there	is more	than one setting of one	of  these  limits,  the	 lower
       value  is used. The heap	limit is specified in kibibytes	(units of 1024
       bytes).

       Prior to	release	10.30, LIMIT_DEPTH was	called	LIMIT_RECURSION.  This
       name is still recognized	for backwards compatibility.

       The heap	limit applies only when	the pcre2_match() or pcre2_dfa_match()
       interpreters are	used for matching. It does not apply to	JIT. The match
       limit  is used (but in a	different way) when JIT	is being used, or when
       pcre2_dfa_match() is called, to limit computing resource	usage by those
       matching	functions. The depth limit is ignored by JIT but  is  relevant
       for  DFA	 matching, which uses function recursion for recursions	within
       the pattern and for lookaround assertions and atomic  groups.  In  this
       case, the depth limit controls the depth	of such	recursion.

   Newline conventions

       PCRE2  supports six different conventions for indicating	line breaks in
       strings:	a single CR (carriage return) character, a  single  LF	(line-
       feed) character,	the two-character sequence CRLF, any of	the three pre-
       ceding,	any  Unicode  newline  sequence,  or the NUL character (binary
       zero). The pcre2api page	has further  discussion	 about	newlines,  and
       shows how to set	the newline convention when calling pcre2_compile().

       It  is also possible to specify a newline convention by starting	a pat-
       tern string with	one of the following sequences:

	 (*CR)	      carriage return
	 (*LF)	      linefeed
	 (*CRLF)      carriage return, followed	by linefeed
	 (*ANYCRLF)   any of the three above
	 (*ANY)	      all Unicode newline sequences
	 (*NUL)	      the NUL character	(binary	zero)

       These override the default and the options given	to the compiling func-
       tion. For example, on a Unix system where LF is the default newline se-
       quence, the pattern

	 (*CR)a.b

       changes the convention to CR. That pattern matches "a\nb" because LF is
       no longer a newline. If more than one of	these settings is present, the
       last one	is used.

       The newline convention affects where the	circumflex and	dollar	asser-
       tions are true. It also affects the interpretation of the dot metachar-
       acter  when  PCRE2_DOTALL  is not set, and the behaviour	of \N when not
       followed	by an opening brace. However, it does not affect what  the  \R
       escape  sequence	 matches.  By default, this is any Unicode newline se-
       quence, for Perl	compatibility. However,	this can be changed;  see  the
       next section and	the description	of \R in the section entitled "Newline
       sequences"  below. A change of \R setting can be	combined with a	change
       of newline convention.

   Specifying what \R matches

       It is possible to restrict \R to	match only CR, LF, or CRLF (instead of
       the complete set	 of  Unicode  line  endings)  by  setting  the	option
       PCRE2_BSR_ANYCRLF  at compile time. This	effect can also	be achieved by
       starting	a pattern with (*BSR_ANYCRLF).	For  completeness,  (*BSR_UNI-
       CODE) is	also recognized, corresponding to PCRE2_BSR_UNICODE.

CHARACTERS AND METACHARACTERS

       A  regular  expression  is  a pattern that is matched against a subject
       string from left	to right. Most characters stand	for  themselves	 in  a
       pattern,	 and  match  the corresponding characters in the subject. As a
       trivial example,	the pattern

	 The quick brown fox

       matches a portion of a subject string that is identical to itself. When
       caseless	matching is  specified	(the  PCRE2_CASELESS  option  or  (?i)
       within  the  pattern),  letters are matched independently of case. Note
       that there are two ASCII	characters, K and  S,  that,  in  addition  to
       their  lower  case  ASCII equivalents, are case-equivalent with Unicode
       U+212A (Kelvin sign) and	 U+017F	 (long	S)  respectively  when	either
       PCRE2_UTF or PCRE2_UCP is set, unless the PCRE2_EXTRA_CASELESS_RESTRICT
       option  is in force (either passed to pcre2_compile() or	set by (*CASE-
       LESS_RESTRICT) or (?r) within the pattern).  If	the  PCRE2_EXTRA_TURK-
       ISH_CASING  option is in	force (either passed to	pcre2_compile()	or set
       by (*TURKISH_CASING) within the pattern),  then	the  'i'  letters  are
       matched according to Turkish and	Azeri languages.

       The power of regular expressions	comes from the ability to include wild
       cards, character	classes, alternatives, and repetitions in the pattern.
       These are encoded in the	pattern	by the use of metacharacters, which do
       not  stand  for	themselves but instead are interpreted in some special
       way.

       There are two different sets of metacharacters: those that  are	recog-
       nized  anywhere in the pattern except within square brackets, and those
       that are	recognized within square brackets.  Outside  square  brackets,
       the metacharacters are as follows:

	 \	general	escape character with several uses
	 ^	assert start of	string (or line, in multiline mode)
	 $	assert end of string (or line, in multiline mode)
	 .	match any character except newline (by default)
	 [	start character	class definition
	 |	start of alternative branch
	 (	start group or control verb
	 )	end group or control verb
	 *	0 or more quantifier
	 +	1 or more quantifier; also "possessive quantifier"
	 ?	0 or 1 quantifier; also	quantifier minimizer
	 {	potential start	of min/max quantifier

       Brace  characters  {  and } are also used to enclose data for construc-
       tions such as \g{2} or \k{name}.	In almost all uses  of	braces,	 space
       and/or horizontal tab characters	that follow { or precede } are allowed
       and  are	 ignored. In the case of quantifiers, they may also appear be-
       fore or after the comma.	The exception to this is \u{...} which	is  an
       ECMAScript  compatibility  feature  that	 is  recognized	 only when the
       PCRE2_EXTRA_ALT_BSUX option is set. ECMAScript  does  not  ignore  such
       white space; it causes the item to be interpreted as literal.

       Part  of	 a  pattern  that is in	square brackets	is called a "character
       class". In a character class the	only metacharacters are:

	 \	general	escape character
	 ^	negate the class, but only if the first	character
	 -	indicates character range
	 [	POSIX character	class (if followed by POSIX syntax)
	 ]	terminates the character class

       If a pattern is compiled	with the  PCRE2_EXTENDED  option,  most	 white
       space in	the pattern, other than	in a character class, within a \Q...\E
       sequence,  or  between  a # outside a character class and the next new-
       line, inclusive,	is ignored. An escaping	backslash can be used  to  in-
       clude  a	 white	space  or a # character	as part	of the pattern.	If the
       PCRE2_EXTENDED_MORE option is set, the same applies,  but  in  addition
       unescaped  space	 and  horizontal  tab  characters are ignored inside a
       character class.	Note: only these two characters	are ignored,  not  the
       full  set  of pattern white space characters that are ignored outside a
       character class.	Option settings	can be changed within a	 pattern;  see
       the section entitled "Internal Option Setting" below.

       The following sections describe the use of each of the metacharacters.

BACKSLASH

       The backslash character has several uses. Firstly, if it	is followed by
       a  character that is not	a digit	or a letter, it	takes away any special
       meaning that character may have.	This use of  backslash	as  an	escape
       character applies both inside and outside character classes.

       For  example,  if you want to match a * character, you must write \* in
       the pattern. This escaping action applies whether or not	the  following
       character  would	 otherwise be interpreted as a metacharacter, so it is
       always safe to precede a	non-alphanumeric  with	backslash  to  specify
       that it stands for itself.  In particular, if you want to match a back-
       slash, you write	\\.

       Only  ASCII  digits  and	letters	have any special meaning after a back-
       slash. All other	characters (in particular, those whose code points are
       greater than 127) are treated as	literals.

       If you want to treat all	characters in a	sequence as literals, you  can
       do  so by putting them between \Q and \E. Note that this	includes white
       space even when the PCRE2_EXTENDED option is set	 so  that  most	 other
       white  space is ignored.	The behaviour is different from	Perl in	that $
       and @ are handled as literals in	\Q...\E	sequences in PCRE2, whereas in
       Perl, $ and @ cause variable interpolation. Also,  Perl	does  "double-
       quotish	backslash  interpolation" on any backslashes between \Q	and \E
       which, its documentation	says, "may lead	to confusing  results".	 PCRE2
       treats  a  backslash  between  \Q and \E	just like any other character.
       Note the	following examples:

	 Pattern	    PCRE2 matches   Perl matches

	 \Qabc$xyz\E	    abc$xyz	   abc followed	by the
					     contents of $xyz
	 \Qabc\$xyz\E	    abc\$xyz	   abc\$xyz
	 \Qabc\E\$\Qxyz\E   abc$xyz	   abc$xyz
	 \QA\B\E	    A\B		   A\B
	 \Q\\E		    \		   \\E

       The \Q...\E sequence is recognized both inside  and  outside  character
       classes.	  An  isolated \E that is not preceded by \Q is	ignored. If \Q
       is not followed by \E later in the pattern, the literal	interpretation
       continues  to  the  end	of  the	pattern	(that is, \E is	assumed	at the
       end). If	the isolated \Q	is inside a character class,  this  causes  an
       error,  because the character class is then not terminated by a closing
       square bracket.

       Another difference from Perl is that any	appearance of \Q or \E	inside
       what  might otherwise be	a quantifier causes PCRE2 not to recognize the
       sequence	as a quantifier. Perl recognizes a quantifier if (redundantly)
       either of the numbers is	inside \Q...\E,	 but  not  if  the  separating
       comma  is.  When	 not  recognized  as  a	 quantifier a sequence such as
       {\Q1\E,2} is treated as the literal string "{1,2}".

   Non-printing	characters

       A second	use of backslash provides a way	of encoding non-printing char-
       acters in patterns in a visible manner. There is	no restriction on  the
       appearance  of non-printing characters in a pattern, but	when a pattern
       is being	prepared by text editing, it is	often easier to	use one	of the
       following escape	sequences instead of the binary	 character  it	repre-
       sents.  In  an  ASCII or	Unicode	environment, these escapes are as fol-
       lows:

	 \a	     alarm, that is, the BEL character (hex 07)
	 \cx	     "control-x", where	x is a non-control ASCII character
	 \e	     escape (hex 1B)
	 \f	     form feed (hex 0C)
	 \n	     linefeed (hex 0A)
	 \r	     carriage return (hex 0D) (but see below)
	 \t	     tab (hex 09)
	 \0dd	     character with octal code 0dd
	 \ddd	     character with octal code ddd, or back reference
	 \o{ddd..}   character with octal code ddd..
	 \xhh	     character with hex	code hh
	 \x{hhh..}   character with hex	code hhh..
	 \N{U+hhh..} character with Unicode hex	code point hhh..

       A description of	how back references work is given later, following the
       discussion of parenthesized groups.

       By default, after \x that is not	followed by {, one or two  hexadecimal
       digits are read (letters	can be in upper	or lower case).	If the charac-
       ter  that follows \x is neither { nor a hexadecimal digit, an error oc-
       curs. This is different from Perl's default behaviour, which  generates
       a  NUL  character, but is in line with the behaviour of Perl's 'strict'
       mode in re.

       Any number of hexadecimal digits	may appear between \x{	and  }.	 If  a
       character  other	than a hexadecimal digit appears between \x{ and }, or
       if there	is no terminating }, an	error occurs.

       Characters whose	code points are	less than 256 can be defined by	either
       of the two syntaxes for \x or by	an octal sequence. There is no differ-
       ence in the way they are	handled. For example, \xdc is exactly the same
       as \x{dc} or \334.  However, using the braced versions does  make  such
       sequences easier	to read.

       Support	is  available  for some	ECMAScript (aka	JavaScript) escape se-
       quences via two compile-time options. If	PCRE2_ALT_BSUX is set, the se-
       quence \x followed by { is not recognized. Only if \x  is  followed  by
       two  hexadecimal	 digits	is it recognized as a character	escape.	Other-
       wise it is interpreted as a literal "x" character. In this  mode,  sup-
       port  for code points greater than 256 is provided by \u, which must be
       followed	by four	hexadecimal digits; otherwise it is interpreted	 as  a
       literal "u" character.

       PCRE2_EXTRA_ALT_BSUX  has the same effect as PCRE2_ALT_BSUX and,	in ad-
       dition, \u{hhh..} is recognized as the character	specified by hexadeci-
       mal code	point.	There may be any number	of hexadecimal digits, but un-
       like other places that also use curly brackets, spaces are not  allowed
       and  would  result  in  the string being	interpreted as a literal. This
       syntax is from ECMAScript 6.

       The \N{U+hhh..} escape sequence is recognized only when PCRE2 is	 oper-
       ating  in  UTF  mode.  Perl also	uses \N{name} to specify characters by
       Unicode name; PCRE2 does	not support this. Note that  when  \N  is  not
       followed	by an opening brace (curly bracket) it has an entirely differ-
       ent meaning, matching any character that	is not a newline.

       There  are some legacy applications where the escape sequence \r	is ex-
       pected to match a newline. If the  PCRE2_EXTRA_ESCAPED_CR_IS_LF	option
       is  set,	 \r  in	 a  pattern is converted to \n so that it matches a LF
       (linefeed) instead of a CR (carriage return) character.

       An error	occurs if \c is	not followed by	a character whose  ASCII  code
       point  is  in the range 32 to 126. The precise effect of	\cx is as fol-
       lows: if	x is a lower case letter, it is	converted to upper case.  Then
       bit 6 of	the character (hex 40) is inverted. Thus \cA to	\cZ become hex
       01  to hex 1A (A	is 41, Z is 5A), but \c{ becomes hex 3B	({ is 7B), and
       \c; becomes hex 7B (; is	3B). If	the code unit following	\c has a  code
       point less than 32 or greater than 126, a compile-time error occurs.

       For  differences	in the way some	escapes	behave in EBCDIC environments,
       see section "EBCDIC environments" below.

   Octal escapes and back references

       The escape \o must be followed by a sequence of octal digits,  enclosed
       in  braces.  An	error occurs if	this is	not the	case. This escape pro-
       vides a way of  specifying  character  code  points  as	octal  numbers
       greater	than 0777, and it also allows octal numbers and	backreferences
       to be unambiguously distinguished.

       If braces are not used, after \0	up to two  further  octal  digits  are
       read.   However,	 if the	PCRE2_EXTRA_NO_BS0 option is set, at least one
       more octal digit	must follow \0 (use \00	to generate a NUL  character).
       Make  sure  you supply two digits after the initial zero	if the pattern
       character that follows is itself	an octal digit.

       Inside a	character class, when a	backslash is  followed	by  any	 octal
       digit,  up to three octal digits	are read to generate a code point. Any
       subsequent digits stand for themselves. The sequences  \8  and  \9  are
       treated as the literal characters "8" and "9".

       Outside a character class, Perl's handling of a backslash followed by a
       digit  other  than  0 is	complicated by ambiguity, and Perl has changed
       over time, causing PCRE2	also to	change.	From PCRE2 release 10.45 there
       is an option called PCRE2_EXTRA_PYTHON_OCTAL that causes	PCRE2  to  use
       Python's	 unambiguous  rules. The next two subsections describe the two
       sets of rules.

       For greater clarity and unambiguity, it is best to avoid	following \ by
       a digit greater than zero. Instead, use \o{...} or \x{...}  to  specify
       numerical character code	points,	and \g{...} to specify backreferences.

   Perl	rules for non-class backslash 1-9

       All  the	digits that follow the backslash are read as a decimal number.
       If the number is	less than 10, begins with the digit  8	or  9,	or  if
       there are at least that many previous capture groups in the expression,
       the  entire  sequence  is  taken	 as a back reference. Otherwise, up to
       three octal digits are read to form a character code. For example:

	 \040	is another way of writing an ASCII space
	 \40	is the same, provided there are	fewer than 40
		   previous capture groups
	 \7	is always a backreference
	 \11	might be a backreference, or another way of
		   writing a tab
	 \011	is always a tab
	 \0113	is a tab followed by the character "3"
	 \113	might be a backreference, otherwise the
		   character with octal	code 113
	 \377	might be a backreference, otherwise
		   the value 255 (decimal)
	 \81	is always a backreference

       Note that octal values of 100 or	greater	that are specified using  this
       syntax  must  not be introduced by a leading zero, because no more than
       three octal digits are ever read.

   Python rules	for non_class backslash	1-9

       If there	are at least three octal digits	after the  backslash,  exactly
       three  are read as an octal code	point number, but the value must be no
       greater than \377, even in modes	where higher  code  point  values  are
       supported.  Any	subsequent  digits  stand for themselves. If there are
       fewer than three	octal digits, the sequence is taken as a decimal  back
       reference.  Thus, for example, \12 is always a back reference, indepen-
       dent of how many	captures there are in the pattern. An error is	gener-
       ated for	a reference to a non-existent capturing	group.

   Constraints on character values

       Characters  that	 are  specified	using octal or hexadecimal numbers are
       limited to certain values, as follows:

	 8-bit non-UTF mode    no greater than 0xff
	 16-bit	non-UTF	mode   no greater than 0xffff
	 32-bit	non-UTF	mode   no greater than 0xffffffff
	 All UTF modes	       no greater than 0x10ffff	and a valid code point

       Invalid Unicode code points are all those in the	range 0xd800 to	0xdfff
       (the so-called "surrogate" code points).	The check  for	these  can  be
       disabled	 by  the  caller  of  pcre2_compile()  by  setting  the	option
       PCRE2_EXTRA_ALLOW_SURROGATE_ESCAPES. However, this is possible only  in
       UTF-8  and  UTF-32 modes, because these values are not representable in
       UTF-16.

   Escape sequences in character classes

       All the sequences that define a single character	value can be used both
       inside and outside character classes. In	addition, inside  a  character
       class, \b is interpreted	as the backspace character (hex	08).

       When not	followed by an opening brace, \N is not	allowed	in a character
       class.	\B,  \R, and \X	are not	special	inside a character class. Like
       other unrecognized alphabetic escape sequences, they  cause  an	error.
       Outside a character class, these	sequences have different meanings.

   Unsupported escape sequences

       In  Perl,  the  sequences  \F, \l, \L, \u, and \U are recognized	by its
       string handler and used to modify the case of following characters.  By
       default,	 PCRE2	does  not  support these escape	sequences in patterns.
       However,	if either of the PCRE2_ALT_BSUX	 or  PCRE2_EXTRA_ALT_BSUX  op-
       tions  is set, \U matches a "U" character, and \u can be	used to	define
       a character by code point, as described above.

   Absolute and	relative backreferences

       The sequence \g followed	by a signed or unsigned	number,	optionally en-
       closed in braces, is an absolute	or  relative  backreference.  A	 named
       backreference  can  be  coded as	\g{name}. Backreferences are discussed
       later, following	the discussion of parenthesized	groups.

   Absolute and	relative subroutine calls

       For compatibility with Oniguruma, the non-Perl syntax \g	followed by  a
       name or a number	enclosed either	in angle brackets or single quotes, is
       an  alternative syntax for referencing a	capture	group as a subroutine.
       Details are discussed later.   Note  that  \g{...}  (Perl  syntax)  and
       \g<...> (Oniguruma syntax) are not synonymous. The former is a backref-
       erence; the latter is a subroutine call.

   Generic character types

       Another use of backslash	is for specifying generic character types:

	 \d	any decimal digit
	 \D	any character that is not a decimal digit
	 \h	any horizontal white space character
	 \H	any character that is not a horizontal white space character
	 \N	any character that is not a newline
	 \s	any white space	character
	 \S	any character that is not a white space	character
	 \v	any vertical white space character
	 \V	any character that is not a vertical white space character
	 \w	any "word" character
	 \W	any "non-word" character

       The  \N	escape	sequence has the same meaning as the "." metacharacter
       when PCRE2_DOTALL is not	set, but setting PCRE2_DOTALL does not	change
       the meaning of \N. Note that when \N is followed	by an opening brace it
       has a different meaning.	See the	section	entitled "Non-printing charac-
       ters"  above for	details. Perl also uses	\N{name} to specify characters
       by Unicode name;	PCRE2 does not support this.

       Each pair of lower and upper case escape	sequences partitions the  com-
       plete  set  of  characters  into	two disjoint sets. Any given character
       matches one, and	only one, of each pair.	The sequences can appear  both
       inside  and outside character classes. They each	match one character of
       the appropriate type. If	the current matching point is at  the  end  of
       the  subject string, all	of them	fail, because there is no character to
       match.

       The default \s characters are HT	(9), LF	(10), VT  (11),	 FF  (12),  CR
       (13),  and  space (32), which are defined as white space	in the "C" lo-
       cale. This list may vary	if locale-specific matching is	taking	place.
       For  example, in	some locales the "non-breaking space" character	(\xA0)
       is recognized as	white space, and in others the VT character is not.

       A "word"	character is an	underscore or any character that is  a	letter
       or  digit.   By	default,  the definition of letters and	digits is con-
       trolled by PCRE2's low-valued character tables, and may vary if locale-
       specific	matching is taking place (see "Locale support" in the pcre2api
       page). For example, in a	French locale such  as	"fr_FR"	 in  Unix-like
       systems,	 or "french" in	Windows, some character	codes greater than 127
       are used	for accented letters, and these	are then matched  by  \w.  The
       use of locales with Unicode is discouraged.

       By  default,  characters	 whose	code points are	greater	than 127 never
       match \d, \s, or	\w, and	always match \D, \S, and \W, although this may
       be different for	characters in the range	128-255	 when  locale-specific
       matching	 is  happening.	  These	escape sequences retain	their original
       meanings	from before Unicode support was	available,  mainly  for	 effi-
       ciency  reasons.	 If  the  PCRE2_UCP  option  is	 set, the behaviour is
       changed so that Unicode properties  are	used  to  determine  character
       types, as follows:

	 \d  any character that	matches	\p{Nd} (decimal	digit)
	 \s  any character that	matches	\p{Z} or \h or \v
	 \w  any character that	matches	\p{L}, \p{N}, \p{Mn}, or \p{Pc}

       The addition of \p{Mn} (non-spacing mark) and the replacement of	an ex-
       plicit  test  for underscore with a test	for \p{Pc} (connector punctua-
       tion) happened in PCRE2 release 10.43. This brings PCRE2	into line with
       Perl.

       The upper case escapes match the	inverse	sets of	characters. Note  that
       \d  matches  only decimal digits, whereas \w matches any	Unicode	digit,
       as well as other	character categories. Note also	that PCRE2_UCP affects
       \b, and \B because they are defined in terms of	\w  and	 \W.  Matching
       these sequences is noticeably slower when PCRE2_UCP is set.

       The  effect  of	PCRE2_UCP  on any one of these escape sequences	can be
       negated by the  options	PCRE2_EXTRA_ASCII_BSD,	PCRE2_EXTRA_ASCII_BSS,
       and  PCRE2_EXTRA_ASCII_BSW,  respectively. These	options	can be set and
       reset within a pattern by means of an internal option setting (see  be-
       low).

       The  sequences  \h, \H, \v, and \V, in contrast to the other sequences,
       which match only	ASCII characters by default, always match  a  specific
       list  of	 code  points, whether or not PCRE2_UCP	is set.	The horizontal
       space characters	are:

	 U+0009	    Horizontal tab (HT)
	 U+0020	    Space
	 U+00A0	    Non-break space
	 U+1680	    Ogham space	mark
	 U+180E	    Mongolian vowel separator
	 U+2000	    En quad
	 U+2001	    Em quad
	 U+2002	    En space
	 U+2003	    Em space
	 U+2004	    Three-per-em space
	 U+2005	    Four-per-em	space
	 U+2006	    Six-per-em space
	 U+2007	    Figure space
	 U+2008	    Punctuation	space
	 U+2009	    Thin space
	 U+200A	    Hair space
	 U+202F	    Narrow no-break space
	 U+205F	    Medium mathematical	space
	 U+3000	    Ideographic	space

       The vertical space characters are:

	 U+000A	    Linefeed (LF)
	 U+000B	    Vertical tab (VT)
	 U+000C	    Form feed (FF)
	 U+000D	    Carriage return (CR)
	 U+0085	    Next line (NEL)
	 U+2028	    Line separator
	 U+2029	    Paragraph separator

       In 8-bit, non-UTF-8 mode, only the characters  with  code  points  less
       than 256	are relevant.

   Newline sequences

       Outside	a  character class, by default,	the escape sequence \R matches
       any Unicode newline sequence. In	8-bit non-UTF-8	mode \R	is  equivalent
       to the following:

	 (?>\r\n|\n|\x0b|\f|\r|\x85)

       This is an example of an	"atomic	group",	details	of which are given be-
       low.   This  particular group matches either the	two-character sequence
       CR followed by LF, or  one  of  the  single  characters	LF  (linefeed,
       U+000A),	 VT  (vertical	tab, U+000B), FF (form feed, U+000C), CR (car-
       riage return, U+000D), or NEL (next line, U+0085). Because this	is  an
       atomic  group,  the  two-character sequence is treated as a single unit
       that cannot be split.

       In other	modes, two additional characters whose code points are greater
       than 255	are added: LS (line separator, U+2028) and PS (paragraph sepa-
       rator, U+2029).	Unicode	support	is not needed for these	characters  to
       be recognized.

       It is possible to restrict \R to	match only CR, LF, or CRLF (instead of
       the  complete  set  of  Unicode	line  endings)	by  setting the	option
       PCRE2_BSR_ANYCRLF at compile time. (BSR is an abbreviation  for	"back-
       slash R".) This can be made the default when PCRE2 is built; if this is
       the  case,  the other behaviour can be requested	via the	PCRE2_BSR_UNI-
       CODE option. It is also possible	to specify these settings by  starting
       a pattern string	with one of the	following sequences:

	 (*BSR_ANYCRLF)	  CR, LF, or CRLF only
	 (*BSR_UNICODE)	  any Unicode newline sequence

       These override the default and the options given	to the compiling func-
       tion.  Note that	these special settings,	which are not Perl-compatible,
       are  recognized only at the very	start of a pattern, and	that they must
       be in upper case. If more than one of them is present, the last one  is
       used. They can be combined with a change	of newline convention; for ex-
       ample, a	pattern	can start with:

	 (*ANY)(*BSR_ANYCRLF)

       They  can also be combined with the (*UTF) or (*UCP) special sequences.
       Inside a	character class, \R is treated as an unrecognized  escape  se-
       quence, and causes an error.

   Unicode character properties

       When  PCRE2  is	built  with Unicode support (the default), three addi-
       tional escape sequences that match characters with specific  properties
       are available. They can be used in any mode, though in 8-bit and	16-bit
       non-UTF	modes these sequences are of course limited to testing charac-
       ters whose code points are less than U+0100 or  U+10000,	 respectively.
       In  32-bit non-UTF mode,	code points greater than 0x10ffff (the Unicode
       limit) may be encountered. These	are all	treated	as being  in  the  Un-
       known script and	with an	unassigned type.

       Matching	 characters by Unicode property	is not fast, because PCRE2 has
       to do a multistage table	lookup in order	to find	 a  character's	 prop-
       erty. That is why the traditional escape	sequences such as \d and \w do
       not  use	 Unicode  properties  in PCRE2 by default, though you can make
       them do so by setting the PCRE2_UCP option or by	starting  the  pattern
       with (*UCP).

       The extra escape	sequences that provide property	support	are:

	 \p{xx}	  a character with the xx property
	 \P{xx}	  a character without the xx property
	 \X	  a Unicode extended grapheme cluster

       For  compatibility  with	Perl, negation can be specified	by including a
       circumflex between the opening brace and	 the  property.	 For  example,
       \p{^Lu} is the same as \P{Lu}.

       In  accordance with Unicode's "loose matching" rules, ASCII white space
       characters, hyphens, and	underscores are	ignored	in the properties rep-
       resented	by xx above. As	well as	the space character, ASCII white space
       can be tab, linefeed, vertical tab, formfeed, or	carriage return.

       Some properties are specified as	a name only; others as a  name	and  a
       value,  separated  by  a	 colon or an equals sign. The names and	values
       consist of ASCII	letters	and digits (with one Perl-specific  exception,
       see  below).  They  are not case	sensitive. Note, however, that the es-
       capes themselves, \p and	\P, are	case sensitive.	 There	are  abbrevia-
       tions for many names. The following examples are	all equivalent:

	 \p{bidiclass=al}
	 \p{BC=al}
	 \p{ Bidi_Class	: AL }
	 \p{ Bi-di class = Al }
	 \P{ ^ Bi-di class = Al	}

       There  is  support  for	Unicode	script names, Unicode general category
       properties, "Any", which	matches	 any  character	 (including  newline),
       Bidi_Class,  a  number  of binary (yes/no) properties, and some special
       PCRE2 properties	(described below).  Certain other Perl properties such
       as "InMusicalSymbols" are not supported by  PCRE2.  Note	 that  \P{Any}
       does not	match any characters, so always	causes a match failure.

   Script properties for \p and	\P

       There are three different syntax	forms for matching a script. Each Uni-
       code  character	has  a	basic  script and, optionally, a list of other
       scripts ("Script	Extensions") with which	it is commonly used. Using the
       Adlam script as an example, \p{sc:Adlam}	matches	characters whose basic
       script is Adlam,	whereas	\p{scx:Adlam} matches, in addition, characters
       that have Adlam in their	extensions list. The full names	 "script"  and
       "script	extensions"  for the property types are	recognized and,	as for
       all property specifications, an equals sign is an  alternative  to  the
       colon.  If a script name	is given without a property type, for example,
       \p{Adlam}, it is	treated	as \p{scx:Adlam}. Perl changed to this	inter-
       pretation at release 5.26 and PCRE2 changed at release 10.40.

       Unassigned characters (and in non-UTF 32-bit mode, characters with code
       points greater than 0x10FFFF) are assigned the "Unknown"	script.	Others
       that  are not part of an	identified script are lumped together as "Com-
       mon". The current list of recognized script names and their 4-character
       abbreviations can be obtained by	running	this command:

	 pcre2test -LS

   The general category	property for \p	and \P

       Each character has exactly one Unicode general category property, spec-
       ified by	a two-letter abbreviation. If only  one	 letter	 is  specified
       with  \p	 or  \P,  it includes all the general category properties that
       start with that letter. In this case, in	the absence of	negation,  the
       curly  brackets in the escape sequence are optional; these two examples
       have the	same effect:

	 \p{L}
	 \pL

       The following general category property codes are supported:

	 C     Other
	 Cc    Control
	 Cf    Format
	 Cn    Unassigned
	 Co    Private use
	 Cs    Surrogate

	 L     Letter
	 Lc    Cased letter
	 Ll    Lower case letter
	 Lm    Modifier	letter
	 Lo    Other letter
	 Lt    Title case letter
	 Lu    Upper case letter

	 M     Mark
	 Mc    Spacing mark
	 Me    Enclosing mark
	 Mn    Non-spacing mark

	 N     Number
	 Nd    Decimal number
	 Nl    Letter number
	 No    Other number

	 P     Punctuation
	 Pc    Connector punctuation
	 Pd    Dash punctuation
	 Pe    Close punctuation
	 Pf    Final punctuation
	 Pi    Initial punctuation
	 Po    Other punctuation
	 Ps    Open punctuation

	 S     Symbol
	 Sc    Currency	symbol
	 Sk    Modifier	symbol
	 Sm    Mathematical symbol
	 So    Other symbol

	 Z     Separator
	 Zl    Line separator
	 Zp    Paragraph separator
	 Zs    Space separator

       Perl originally used the	name L&	for the	Lc  property.  This  is	 still
       supported  by Perl, but discouraged. PCRE2 also still supports it. This
       property	matches	any character that has the Lu, Ll, or Lt property,  in
       other  words,  any  letter  that	 is  not  classified  as a modifier or
       "other".	From release 10.45 of PCRE2 the	properties Lu, Ll, and Lt  are
       all  treated  as	 Lc  when  case-independent  matching  is  set	by the
       PCRE2_CASELESS option or	(?i) within the	pattern. The other  properties
       are not affected	by caseless matching.

       The  Cs	(Surrogate)  property  applies	only  to characters whose code
       points are in the range U+D800 to U+DFFF. These characters are no  dif-
       ferent  to any other character when PCRE2 is not	in UTF mode (using the
       16-bit or 32-bit	library).  However, they  are  not  valid  in  Unicode
       strings and so cannot be	tested by PCRE2	in UTF mode, unless UTF	valid-
       ity   checking	has   been   turned   off   (see   the	discussion  of
       PCRE2_NO_UTF_CHECK in the pcre2api page).

       The long	synonyms for  property	names  that  Perl  supports  (such  as
       \p{Letter})  are	 not supported by PCRE2, nor is	it permitted to	prefix
       any of these properties with "Is".

       No character that is in the Unicode table has the Cn (unassigned) prop-
       erty.  Instead, this property is	assumed	for any	code point that	is not
       in the Unicode table.

   Binary (yes/no) properties for \p and \P

       Unicode defines a number	of  binary  properties,	 that  is,  properties
       whose  only  values  are	 true or false.	You can	obtain a list of those
       that are	recognized by \p and \P, along with  their  abbreviations,  by
       running this command:

	 pcre2test -LP

   The Bidi_Class property for \p and \P

	 \p{Bidi_Class:<class>}	  matches a character with the given class
	 \p{BC:<class>}		  matches a character with the given class

       The recognized classes are:

	 AL	     Arabic letter
	 AN	     Arabic number
	 B	     paragraph separator
	 BN	     boundary neutral
	 CS	     common separator
	 EN	     European number
	 ES	     European separator
	 ET	     European terminator
	 FSI	     first strong isolate
	 L	     left-to-right
	 LRE	     left-to-right embedding
	 LRI	     left-to-right isolate
	 LRO	     left-to-right override
	 NSM	     non-spacing mark
	 ON	     other neutral
	 PDF	     pop directional format
	 PDI	     pop directional isolate
	 R	     right-to-left
	 RLE	     right-to-left embedding
	 RLI	     right-to-left isolate
	 RLO	     right-to-left override
	 S	     segment separator
	 WS	     white space

       As  in  all property specifications, an equals sign may be used instead
       of a colon and the class	names are  case-insensitive.  Only  the	 short
       names  listed  above  are recognized; PCRE2 does	not at present support
       any long	alternatives.

   Extended grapheme clusters

       The \X escape matches any number	of Unicode  characters	that  form  an
       "extended grapheme cluster", and	treats the sequence as an atomic group
       (see  below).  Unicode supports various kinds of	composite character by
       giving each character a grapheme	breaking property,  and	 having	 rules
       that use	these properties to define the boundaries of extended grapheme
       clusters.  The rules are	defined	in Unicode Standard Annex 29, "Unicode
       Text Segmentation". Unicode 11.0.0 abandoned the	use of	some  previous
       properties  that	had been used for emojis.  Instead it introduced vari-
       ous emoji-specific properties. PCRE2  uses  only	 the  Extended	Picto-
       graphic property.

       \X  always  matches  at least one character. Then it decides whether to
       add additional characters according to the following rules for ending a
       cluster:

       1. End at the end of the	subject	string.

       2. Do not end between CR	and LF;	otherwise end after any	control	 char-
       acter.

       3.  Do  not  break  Hangul (a Korean script) syllable sequences.	Hangul
       characters are of five types: L,	V, T, LV, and LVT. An L	character  may
       be  followed by an L, V,	LV, or LVT character; an LV or V character may
       be followed by a	V or T character; an LVT or T character	 may  be  fol-
       lowed only by a T character.

       4. Do not end before extending characters or spacing marks or the zero-
       width  joiner  (ZWJ) character. Characters with the "mark" property al-
       ways have the "extend" grapheme breaking	property.

       5. Do not end after prepend characters.

       6. Do not end within emoji modifier sequences or	emoji ZWJ  (zero-width
       joiner)	sequences.  An emoji ZWJ sequence consists of a	character with
       the Extended_Pictographic property, optionally followed by one or  more
       characters  with	 the  Extend  property,	followed by the	ZWJ character,
       followed	by another Extended_Pictographic character.

       7. Do not break within emoji flag sequences. That is, do	not break  be-
       tween  regional indicator (RI) characters if there are an odd number of
       RI characters before the	break point.

       8. Otherwise, end the cluster.

   PCRE2's additional properties

       As well as the standard Unicode properties described above, PCRE2  sup-
       ports four more that make it possible to	convert	traditional escape se-
       quences	such  as \w and	\s to use Unicode properties. PCRE2 uses these
       non-standard, non-Perl properties internally  when  PCRE2_UCP  is  set.
       However,	they may also be used explicitly. These	properties are:

	 Xan   Any alphanumeric	character
	 Xps   Any POSIX space character
	 Xsp   Any Perl	space character
	 Xwd   Any Perl	"word" character

       Xan  matches  characters	that have either the L (letter)	or the N (num-
       ber) property. Xps matches the characters tab, linefeed,	vertical  tab,
       form  feed,  or carriage	return,	and any	other character	that has the Z
       (separator) property (this includes the space character).  Xsp  is  the
       same as Xps; in PCRE1 it	used to	exclude	vertical tab, for Perl compat-
       ibility,	but Perl changed. Xwd matches the same characters as Xan, plus
       those  that  match  Mn (non-spacing mark) or Pc (connector punctuation,
       which includes underscore).

       There is	another	non-standard property, Xuc, which matches any  charac-
       ter  that  can  be represented by a Universal Character Name in C++ and
       other programming languages. These are the characters $,	 @,  `	(grave
       accent),	 and  all  characters with Unicode code	points greater than or
       equal to	U+00A0,	except for the surrogates U+D800 to U+DFFF. Note  that
       most  base  (ASCII) characters are excluded. (Universal Character Names
       are of the form \uHHHH or \UHHHHHHHH where H is	a  hexadecimal	digit.
       Note that the Xuc property does not match these sequences but the char-
       acters that they	represent.)

   Resetting the match start

       In  normal  use,	 the  escape sequence \K causes	any previously matched
       characters not to be included in	the final matched sequence that	is re-
       turned. For example, the	pattern:

	 foo\Kbar

       matches "foobar", but reports that it has matched "bar".	 \K  does  not
       interact	with anchoring in any way. The pattern:

	 ^foo\Kbar

       matches	only  when  the	 subject  begins with "foobar" (in single line
       mode), though it	again reports the matched string as "bar".  This  fea-
       ture  is	 similar  to a lookbehind assertion (described below), but the
       part of the pattern that	precedes \K is not constrained to match	a lim-
       ited number of characters, as is	required for a	lookbehind  assertion.
       The  use	 of  \K	 does  not interfere with the setting of captured sub-
       strings.	 For example, when the pattern

	 (foo)\Kbar

       matches "foobar", the first substring is	still set to "foo".

       From version 5.32.0 Perl	forbids	the use	of  \K	in  lookaround	asser-
       tions.  From release 10.38 PCRE2	also forbids this by default. However,
       the PCRE2_EXTRA_ALLOW_LOOKAROUND_BSK option can be  used	 when  calling
       pcre2_compile()	to  re-enable the previous behaviour. When this	option
       is set, \K is acted upon	when it	occurs inside positive assertions, but
       is ignored in negative assertions. Note that when  a  pattern  such  as
       (?=ab\K)	 matches,  the reported	start of the match can be greater than
       the end of the match. Using \K in a lookbehind assertion	at  the	 start
       of  a  pattern can also lead to odd effects. For	example, consider this
       pattern:

	 (?<=\Kfoo)bar

       If the subject is "foobar", a call to  pcre2_match()  with  a  starting
       offset  of 3 succeeds and reports the matching string as	"foobar", that
       is, the start of	the reported match is earlier  than  where  the	 match
       started.

   Simple assertions

       The  final use of backslash is for certain simple assertions. An	asser-
       tion specifies a	condition that has to be met at	a particular point  in
       a  match, without consuming any characters from the subject string. The
       use of groups for more complicated assertions is	described below.   The
       backslashed assertions are:

	 \b	matches	at a word boundary
	 \B	matches	when not at a word boundary
	 \A	matches	at the start of	the subject
	 \Z	matches	at the end of the subject
		 also matches before a newline at the end of the subject
	 \z	matches	only at	the end	of the subject
	 \G	matches	at the first matching position in the subject

       Inside  a  character  class, \b has a different meaning;	it matches the
       backspace character. If any other of  these  assertions	appears	 in  a
       character class,	an "invalid escape sequence" error is generated.

       A  word	boundary is a position in the subject string where the current
       character and the previous character do not both	match \w or  \W	 (i.e.
       one  matches  \w	 and the other matches \W), or the start or end	of the
       string if the first or last character matches  \w,  respectively.  When
       PCRE2  is  built	with Unicode support, the meanings of \w and \W	can be
       changed by setting the PCRE2_UCP	option.	When this is done, it also af-
       fects \b	and \B.	Neither	PCRE2 nor Perl has a separate "start of	 word"
       or  "end	 of  word" metasequence. However, whatever follows \b normally
       determines which	it is. For example, the	fragment \ba  matches  "a"  at
       the start of a word.

       The  \A,	 \Z,  and \z assertions	differ from the	traditional circumflex
       and dollar (described in	the next section) in that they only ever match
       at the very start and end of the	subject	string,	whatever  options  are
       set.  Thus,  they are independent of multiline mode. These three	asser-
       tions are not affected by the  PCRE2_NOTBOL  or	PCRE2_NOTEOL  options,
       which  affect only the behaviour	of the circumflex and dollar metachar-
       acters. However,	if the startoffset argument of pcre2_match()  is  non-
       zero,  indicating  that	matching is to start at	a point	other than the
       beginning of the	subject, \A can	never match.  The  difference  between
       \Z  and \z is that \Z matches before a newline at the end of the	string
       as well as at the very end, whereas \z matches only at the end.

       The \G assertion	is true	only when the current matching position	is  at
       the  start point	of the matching	process, as specified by the startoff-
       set argument of pcre2_match(). It differs from \A  when	the  value  of
       startoffset  is	non-zero. By calling pcre2_match() multiple times with
       appropriate arguments, you can mimic Perl's /g option,  and  it	is  in
       this kind of implementation where \G can	be useful.

       Note,  however,	that  PCRE2's  implementation of \G, being true	at the
       starting	character of the matching process, is  subtly  different  from
       Perl's,	which  defines it as true at the end of	the previous match. In
       Perl, these can be different when the  previously  matched  string  was
       empty. Because PCRE2 does just one match	at a time, it cannot reproduce
       this behaviour.

       If  all	the alternatives of a pattern begin with \G, the expression is
       anchored	to the starting	match position,	and the	"anchored" flag	is set
       in the compiled regular expression.

CIRCUMFLEX AND DOLLAR

       The circumflex and dollar  metacharacters  are  zero-width  assertions.
       That  is,  they test for	a particular condition being true without con-
       suming any characters from the subject string. These two	metacharacters
       are concerned with matching the starts and ends of lines. If  the  new-
       line  convention	is set so that only the	two-character sequence CRLF is
       recognized as a newline,	isolated CR and	LF characters are  treated  as
       ordinary	data characters, and are not recognized	as newlines.

       Outside a character class, in the default matching mode,	the circumflex
       character  is  an  assertion  that is true only if the current matching
       point is	at the start of	the subject string. If the  startoffset	 argu-
       ment  of	 pcre2_match() is non-zero, or if PCRE2_NOTBOL is set, circum-
       flex can	never match if the PCRE2_MULTILINE option is unset.  Inside  a
       character  class, circumflex has	an entirely different meaning (see be-
       low).

       Circumflex need not be the first	character of the pattern if  a	number
       of  alternatives	are involved, but it should be the first thing in each
       alternative in which it appears if the pattern is ever  to  match  that
       branch.	If all possible	alternatives start with	a circumflex, that is,
       if the pattern is constrained to	match only at the start	 of  the  sub-
       ject,  it  is  said  to be an "anchored"	pattern. (There	are also other
       constructs that can cause a pattern to be anchored.)

       The dollar character is an assertion that is true only if  the  current
       matching	 point is at the end of	the subject string, or immediately be-
       fore a newline at the end of the	string (by default), unless  PCRE2_NO-
       TEOL  is	 set.  Note, however, that it does not actually	match the new-
       line. Dollar need not be	the last character of the pattern if a	number
       of  alternatives	 are  involved,	 but it	should be the last item	in any
       branch in which it appears. Dollar has no special meaning in a  charac-
       ter class.

       The  meaning  of	 dollar	 can be	changed	so that	it matches only	at the
       very end	of the string, by setting the PCRE2_DOLLAR_ENDONLY  option  at
       compile time. This does not affect the \Z assertion.

       The meanings of the circumflex and dollar metacharacters	are changed if
       the  PCRE2_MULTILINE  option  is	 set.  When this is the	case, a	dollar
       character matches before	any newlines in	the string, as well as at  the
       very  end, and a	circumflex matches immediately after internal newlines
       as well as at the start of the subject string. It does not match	 after
       a  newline  that	ends the string, for compatibility with	Perl. However,
       this can	be changed by setting the PCRE2_ALT_CIRCUMFLEX option.

       For example, the	pattern	/^abc$/	matches	the subject string  "def\nabc"
       (where  \n  represents a	newline) in multiline mode, but	not otherwise.
       Consequently, patterns that are anchored	in single  line	 mode  because
       all  branches  start  with  ^ are not anchored in multiline mode, and a
       match for circumflex is	possible  when	the  startoffset  argument  of
       pcre2_match()  is  non-zero. The	PCRE2_DOLLAR_ENDONLY option is ignored
       if PCRE2_MULTILINE is set.

       When the	newline	convention (see	"Newline  conventions"	below)	recog-
       nizes  the two-character	sequence CRLF as a newline, this is preferred,
       even if the single characters CR	and LF are  also  recognized  as  new-
       lines.  For  example,  if  the newline convention is "any", a multiline
       mode circumflex matches before "xyz" in the string "abc\r\nxyz"	rather
       than  after  CR,	even though CR on its own is a valid newline. (It also
       matches at the very start of the	string,	of course.)

       Note that the sequences \A, \Z, and \z can be used to match  the	 start
       and  end	of the subject in both modes, and if all branches of a pattern
       start with \A it	is always anchored, whether or not PCRE2_MULTILINE  is
       set.

FULL STOP (PERIOD, DOT)	AND \N

       Outside a character class, a dot	in the pattern matches any one charac-
       ter  in	the subject string except (by default) a character that	signi-
       fies the	end of a line. One or more characters may be specified as line
       terminators (see	"Newline conventions" above).

       Dot never matches a single line-ending character. When the  two-charac-
       ter  sequence CRLF is the only line ending, dot does not	match CR if it
       is immediately followed by LF, but otherwise it matches all  characters
       (including  isolated  CRs  and  LFs). When ANYCRLF is selected for line
       endings,	no occurrences of CR of	LF match dot. When  all	 Unicode  line
       endings are being recognized, dot does not match	CR or LF or any	of the
       other line ending characters.

       The  behaviour  of  dot	with regard to newlines	can be changed.	If the
       PCRE2_DOTALL option is set, a dot matches any  one  character,  without
       exception.   If	the two-character sequence CRLF	is present in the sub-
       ject string, it takes two dots to match it.

       The handling of dot is entirely independent of the handling of  circum-
       flex  and  dollar,  the	only relationship being	that they both involve
       newlines. Dot has no special meaning in a character class.

       The escape sequence \N when not followed	by an  opening	brace  behaves
       like  a dot, except that	it is not affected by the PCRE2_DOTALL option.
       In other	words, it matches any character	except one that	signifies  the
       end of a	line.

       When \N is followed by an opening brace it has a	different meaning. See
       the  section entitled "Non-printing characters" above for details. Perl
       also uses \N{name} to specify characters	by Unicode  name;  PCRE2  does
       not support this.

MATCHING A SINGLE CODE UNIT

       Outside	a character class, the escape sequence \C matches any one code
       unit, whether or	not a UTF mode is set. In the 8-bit library, one  code
       unit  is	 one  byte;  in	the 16-bit library it is a 16-bit unit;	in the
       32-bit library it is a 32-bit unit. Unlike a  dot,  \C  always  matches
       line-ending  characters.	 The  feature  is provided in Perl in order to
       match individual	bytes in UTF-8 mode, but it is unclear how it can use-
       fully be	used.

       Because \C breaks up characters into individual	code  units,  matching
       one  unit  with	\C  in UTF-8 or	UTF-16 mode means that the rest	of the
       string may start	with a malformed UTF character.	This has undefined re-
       sults, because PCRE2 assumes that it is matching	character by character
       in a valid UTF string (by default it checks the subject string's	valid-
       ity at  the  start  of  processing  unless  the	PCRE2_NO_UTF_CHECK  or
       PCRE2_MATCH_INVALID_UTF option is used).

       An   application	  can	lock   out  the	 use  of  \C  by  setting  the
       PCRE2_NEVER_BACKSLASH_C option when compiling a	pattern.  It  is  also
       possible	to build PCRE2 with the	use of \C permanently disabled.

       PCRE2  does  not	allow \C to appear in lookbehind assertions (described
       below) in UTF-8 or UTF-16 modes,	because	this would make	it  impossible
       to  calculate  the  length  of  the lookbehind. Neither the alternative
       matching	function pcre2_dfa_match() nor the JIT optimizer support \C in
       these UTF modes.	 The former gives a match-time error; the latter fails
       to optimize and so the match is always run using	the interpreter.

       In the 32-bit library, however, \C is always supported  (when  not  ex-
       plicitly	 locked	 out)  because	it  always matches a single code unit,
       whether or not UTF-32 is	specified.

       In general, the \C escape sequence is best avoided. However, one	way of
       using it	that avoids the	problem	of malformed UTF-8 or  UTF-16  charac-
       ters  is	 to use	a lookahead to check the length	of the next character,
       as in this pattern, which could be used with  a	UTF-8  string  (ignore
       white space and line breaks):

	 (?| (?=[\x00-\x7f])(\C) |
	     (?=[\x80-\x{7ff}])(\C)(\C)	|
	     (?=[\x{800}-\x{ffff}])(\C)(\C)(\C)	|
	     (?=[\x{10000}-\x{1fffff}])(\C)(\C)(\C)(\C))

       In  this	 example,  a  group  that starts with (?| resets the capturing
       parentheses numbers in each alternative (see "Duplicate Group  Numbers"
       below). The assertions at the start of each branch check	the next UTF-8
       character  for  values whose encoding uses 1, 2,	3, or 4	bytes, respec-
       tively. The character's individual bytes	are then captured by  the  ap-
       propriate number	of \C groups.

SQUARE BRACKETS	AND CHARACTER CLASSES

       An opening square bracket introduces a character	class, terminated by a
       closing square bracket. A closing square	bracket	on its own is not spe-
       cial  by	 default.  If a	closing	square bracket is required as a	member
       of the class, it	should be the first data character in the class	(after
       an initial circumflex, if present) or escaped with  a  backslash.  This
       means  that,  by	default, an empty class	cannot be defined. However, if
       the PCRE2_ALLOW_EMPTY_CLASS option is set, a closing square bracket  at
       the start does end the (empty) class.

       A  character class matches a single character in	the subject. A matched
       character must be in the	set of characters defined by the class,	unless
       the first character in the class	definition is a	circumflex,  in	 which
       case the	subject	character must not be in the set defined by the	class.
       If  a  circumflex is actually required as a member of the class,	ensure
       it is not the first character, or escape	it with	a backslash.

       For example, the	character class	[aeiou]	matches	any lower case English
       vowel, whereas [^aeiou] matches all other characters. Note that a  cir-
       cumflex	is  just  a  convenient	notation for specifying	the characters
       that are	in the class by	enumerating those that are not.	A  class  that
       starts with a circumflex	is not an assertion; it	still consumes a char-
       acter  from  the	subject	string,	and therefore it fails to match	if the
       current pointer is at the end of	the string.

       Characters in a class may be specified by their code points  using  \o,
       \x,  or \N{U+hh..} in the usual way. When caseless matching is set, any
       letters in a class represent both their upper case and lower case  ver-
       sions,  so  for example,	a caseless [aeiou] matches "A" as well as "a",
       and a caseless [^aeiou] does not	match "A", whereas a  caseful  version
       would.  Note that there are two ASCII characters, K and S, that,	in ad-
       dition to their lower case ASCII	equivalents, are case-equivalent  with
       Unicode	U+212A (Kelvin sign) and U+017F	(long S) respectively when ei-
       ther PCRE2_UTF or PCRE2_UCP is set. If you do not want these ASCII/non-
       ASCII case equivalences,	you can	suppress  them	by  setting  PCRE2_EX-
       TRA_CASELESS_RESTRICT,  either as an option in a	compile	context, or by
       including (*CASELESS_RESTRICT) or (?r) within a pattern.

       Characters that might indicate line breaks are  never  treated  in  any
       special	way  when matching character classes, whatever line-ending se-
       quence is  in  use,  and	 whatever  setting  of	the  PCRE2_DOTALL  and
       PCRE2_MULTILINE	options	 is  used. A class such	as [^a]	always matches
       one of these characters.

       The generic character type escape sequences \d, \D, \h, \H, \p, \P, \s,
       \S, \v, \V, \w, and \W may appear in a character	 class,	 and  add  the
       characters  that	 they  match  to  the  class.  For example, [\dABCDEF]
       matches any hexadecimal digit. In UTF modes, the	PCRE2_UCP  option  af-
       fects the meanings of \d, \s, \w	and their upper	case partners, just as
       it does when they appear	outside	a character class, as described	in the
       section	entitled  "Generic character types" above. The escape sequence
       \b has a	different meaning inside a character  class;  it  matches  the
       backspace  character.  The sequences \B,	\R, and	\X are not special in-
       side a character	class. Like any	other unrecognized  escape  sequences,
       they  cause  an	error. The same	is true	for \N when not	followed by an
       opening brace.

       The minus (hyphen) character can	be used	to specify a range of  charac-
       ters  in	 a  character class. For example, [d-m]	matches	any letter be-
       tween d and m, inclusive. If a minus character is required in a	class,
       it  must	 be  escaped with a backslash or appear	in a position where it
       cannot be interpreted as	indicating a range, typically as the first  or
       last character in the class, or immediately after a range. For example,
       [b-d-z] matches letters in the range b to d, a hyphen character,	or z.

       There  is  some special treatment for alphabetic	ranges in EBCDIC envi-
       ronments; see the section "EBCDIC environments" below.

       Perl treats a hyphen as a literal if it appears before or after a POSIX
       class (see below) or before or after a character	type escape such as \d
       or \H.  However,	unless the hyphen is the last character	in the	class,
       Perl  outputs  a	 warning in its	warning	mode, as this is most likely a
       user error. As PCRE2 has	no facility for	warning, an error is given  in
       these cases.

       It is not possible to have the literal character	"]" as the end charac-
       ter  of a range.	A pattern such as [W-]46] is interpreted as a class of
       two characters ("W" and "-") followed by	a literal string "46]",	so  it
       would  match  "W46]"  or	 "-46]". However, if the "]" is	escaped	with a
       backslash it is interpreted as the end of a range, so [W-\]46]  is  in-
       terpreted  as  a	class containing a range and two other characters. The
       octal or	hexadecimal representation of "]" can also be used  to	end  a
       range.

       Ranges normally include all code	points between the start and end char-
       acters,	inclusive. They	can also be used for code points specified nu-
       merically, for example [\000-\037]. Ranges can include  any  characters
       that  are  valid	 for  the current mode.	In any UTF mode, the so-called
       "surrogate" characters (those whose code	points lie between 0xd800  and
       0xdfff  inclusive)  may	not  be	 specified  explicitly by default (the
       PCRE2_EXTRA_ALLOW_SURROGATE_ESCAPES option disables this	 check).  How-
       ever, ranges such as [\x{d7ff}-\x{e000}], which include the surrogates,
       are always permitted.

       If a range that includes	letters	is used	when caseless matching is set,
       it matches the letters in either	case. For example, [W-c] is equivalent
       to  [][\\^_`wxyzabc],  matched  caselessly,  and	 in a non-UTF mode, if
       character tables	for a French locale are	in  use,  [\xc8-\xcb]  matches
       accented	E characters in	both cases.

       A  circumflex  can  conveniently	 be used with the upper	case character
       types to	specify	a more restricted set of characters than the  matching
       lower  case  type.  For example,	the class [^\W_] matches any letter or
       digit, but not underscore, whereas [\w] includes	underscore. A positive
       character class should be read as "something OR something OR ..." and a
       negative	class as "NOT something	AND NOT	something AND NOT ...".

       The metacharacters that are recognized in character classes  are	 back-
       slash,  hyphen (when it can be interpreted as specifying	a range), cir-
       cumflex (only  at  the  start),	and  the  terminating  closing	square
       bracket.	 An  opening square bracket is also special when it can	be in-
       terpreted as introducing	a POSIX	class (see "Posix  character  classes"
       below),	or a special compatibility feature (see	"Compatibility feature
       for word	boundaries" below. Escaping any	non-alphanumeric character  in
       a class turns it	into a literal,	whether	or not it would	otherwise be a
       metacharacter.

PERL EXTENDED CHARACTER	CLASSES

       From  release  10.45  PCRE2 supports Perl's (?[...]) extended character
       class syntax. This can be used to perform set operations	such as	inter-
       section on character classes.

       The syntax permitted within (?[...]) is	quite  different  to  ordinary
       character  classes.  Inside  the	extended class,	there is an expression
       syntax consisting of "atoms", operators,	and ordinary parentheses  "()"
       used  for  grouping.  Such  classes  always  have the Perl /xx modifier
       (PCRE2 option PCRE2_EXTENDED_MORE) turned on within  them.  This	 means
       that  literal  space  and  tab characters are ignored everywhere	in the
       class.

       The allowed atoms are individual	characters  specified  by  escape  se-
       quences	such  as  \n  or  \x{123},  character  types such as \d, POSIX
       classes such as [:alpha:], and nested ordinary (non-extended) character
       classes.	For example, in	(?[\d &	[...]])	the nested class [...] follows
       the usual rules for ordinary character classes,	in  which  parentheses
       are  not	 metacharacters, and character literals	and ranges are permit-
       ted.

       Character literals and ranges may not appear outside a nested  ordinary
       character  class	because	they are not atoms in the extended syntax. The
       extended	syntax does not	introduce any additional escape	sequences,  so
       (?[\y]) is an unknown escape, as	it would be in [\y].

       In the extended syntax, ^ does not negate a class (except within	an or-
       dinary  class  nested inside an extended	class);	it is instead a	binary
       operator.

       The binary operators are	"&" (intersection), "|"	or  "+"	 (union),  "-"
       (subtraction)  and  "^" (symmetric difference). These are left-associa-
       tive and	"&" has	higher (tighter) precedence,  while  the  others  have
       equal  lower  precedence. The one prefix	unary operator is "!" (comple-
       ment), with highest precedence.

UTS#18 EXTENDED	CHARACTER CLASSES

       The PCRE2_ALT_EXTENDED_CLASS option enables an  alternative  to	Perl's
       (?[...])	  syntax, allowing instead extended class behaviour inside or-
       dinary [...]  character classes.	This altered syntax for	[...]  classes
       is  loosely described by	the Unicode standard UTS#18. The PCRE2_ALT_EX-
       TENDED_CLASS option does	not prevent use	of (?[...]) classes;  it  just
       changes	the  meaning of	all [...] classes that are not nested inside a
       Perl (?[...]) class.

       Firstly,	in ordinary Perl [...] syntax, an expression such as "[a[]" is
       a character class with two literal  characters  "a"  and	 "[",  but  in
       UTS#18  extended	 classes  the  "["  character  becomes	an  additional
       metacharacter within classes, denoting the start	of a nested class,  so
       a literal "[" must be escaped as	"\[".

       Secondly,  within the UTS#18 extended syntax, there are operators "||",
       "&&", "--" and "~~" which denote	character class	 union,	 intersection,
       subtraction,  and  symmetric  difference	respectively. In standard Perl
       syntax, these would simply be needlessly-repeated literals (except  for
       "--"  which  could  be the start	or end of a range). In UTS#18 extended
       classes these operators can be used in constructs such as [\p{L}--[QW]]
       for "Unicode letters, other than	Q and W".  A literal "-" at the	 start
       or  end	of a range must	be escaped, so while "[--1]" in	Perl syntax is
       the range from hyphen to	"1", it	must be	escaped	as "[\--1]" in	UTS#18
       extended	classes.

       Unlike Perl's (?[...]) extended classes,	the PCRE2_EXTENDED_MORE	option
       to  ignore  space  and  tab characters is not automatically enabled for
       UTS#18 extended classes,	but it is honoured if set.

       Extended	UTS#18 classes can be nested, and  nested  classes  are	 them-
       selves extended classes (unlike Perl, where nested classes must be sim-
       ple  classes).  For example, [\p{L}&&[\p{Thai}||\p{Greek}]] matches any
       letter that is in the Thai or Greek scripts. Note that this means  that
       no  special grouping characters (such as	the parentheses	used in	Perl's
       (?[...])	class syntax) are needed.

       Individual class	items (literal characters, literal ranges,  properties
       such  as	\d or \p{...}, and nested classes) can be combined by juxtapo-
       sition or by an operator. Juxtaposition is the implicit union operator,
       and binds more tightly than any explicit	operator. Thus a  sequence  of
       literals	and/or ranges behaves as if it is enclosed in square brackets.
       For  example,  [A-Z0-9&&[^E8]]  is the same as [[A-Z0-9]&&[^E8]], which
       matches any upper case alphanumeric character except "E"	or "8".

       Precedence between the explicit operators is not	defined, so mixing op-
       erators is a syntax error. For example,	[A&&B--C]  is  an  error,  but
       [A&&[B--C]] is valid.

       This  is	an emerging syntax which is being adopted gradually across the
       regex ecosystem:	for example JavaScript adopted the "/v"	 flag  in  EC-
       MAScript	 2024; Python's	"re" module reserves the syntax	for future use
       with a FutureWarning for	unescaped use of "[" as	a literal within char-
       acter classes. Due to UTS#18 providing insufficient  guidance,  engines
       interpret  the  syntax  differently.  Rust's "regex" crate and Python's
       "regex" PyPi module both	implement UTS#18 extended  classes,  but  with
       slight	incompatibilities  ([A||B&&C]  is  parsed  as  [A||[B&&C]]  in
       Python's	"regex"	but as [[A||B]&&C] in Rust's "regex").

       PCRE2's syntax adds syntax  restrictions	 similar  to  ECMASCript's  /v
       flag,  so  that	all  the  UTS#18 extended classes accepted as valid by
       PCRE2 have the property that they are interpreted either	with the  same
       behaviour,  or  as  invalid, by all other major engines.	Please file an
       issue if	you are	aware of cross-engine differences in behaviour between
       PCRE2 and another major engine.

POSIX CHARACTER	CLASSES

       Perl supports the POSIX notation	for character classes. This uses names
       enclosed	by [: and :] within the	enclosing square brackets. PCRE2  also
       supports	 this notation,	in both	ordinary and extended classes. For ex-
       ample,

	 [01[:alpha:]%]

       matches "0", "1", any alphabetic	character, or "%". The supported class
       names are:

	 alnum	  letters and digits
	 alpha	  letters
	 ascii	  character codes 0 - 127
	 blank	  space	or tab only
	 cntrl	  control characters
	 digit	  decimal digits (same as \d)
	 graph	  printing characters, excluding space
	 lower	  lower	case letters
	 print	  printing characters, including space
	 punct	  printing characters, excluding letters and digits and	space
	 space	  white	space (the same	as \s from PCRE2 8.34)
	 upper	  upper	case letters
	 word	  "word" characters (same as \w)
	 xdigit	  hexadecimal digits

       The default "space" characters are HT (9), LF (10), VT (11),  FF	 (12),
       CR  (13),  and space (32). If locale-specific matching is taking	place,
       the list	of space characters may	be different; there may	 be  fewer  or
       more  of	 them.	"Space"	and \s match the same set of characters, as do
       "word" and \w.

       The name	"word" is a Perl extension, and	"blank"	 is  a	GNU  extension
       from  Perl  5.8.	Another	Perl extension is negation, which is indicated
       by a ^ character	after the colon. For example,

	 [12[:^digit:]]

       matches "1", "2", or any	non-digit. PCRE2 (and Perl) also recognize the
       POSIX syntax [.ch.] and [=ch=] where "ch" is a "collating element", but
       these are not supported,	and an error is	given if they are encountered.

       By default, characters with values greater than 127 do not match	any of
       the POSIX character classes, although this may be different for charac-
       ters in the range 128-255 when locale-specific matching	is  happening.
       However,	 in UCP	mode, unless certain options are set (see below), some
       of the classes are changed so that  Unicode  character  properties  are
       used. This is achieved by replacing POSIX classes with other sequences,
       as follows:

	 [:alnum:]  becomes  \p{Xan}
	 [:alpha:]  becomes  \p{L}
	 [:blank:]  becomes  \h
	 [:cntrl:]  becomes  \p{Cc}
	 [:digit:]  becomes  \p{Nd}
	 [:lower:]  becomes  \p{Ll}
	 [:space:]  becomes  \p{Xps}
	 [:upper:]  becomes  \p{Lu}
	 [:word:]   becomes  \p{Xwd}

       Negated	versions,  such	as [:^alpha:] use \P instead of	\p. Four other
       POSIX classes are handled specially in UCP mode:

       [:graph:] This matches characters that have glyphs that mark  the  page
		 when printed. In Unicode property terms, it matches all char-
		 acters	with the L, M, N, P, S,	or Cf properties, except for:

		   U+061C	    Arabic Letter Mark
		   U+180E	    Mongolian Vowel Separator
		   U+2066 - U+2069  Various "isolate"s

       [:print:] This  matches	the  same  characters  as [:graph:] plus space
		 characters that are not controls, that	 is,  characters  with
		 the Zs	property.

       [:punct:] This matches all characters that have the Unicode P (punctua-
		 tion)	property,  plus	those characters with code points less
		 than 256 that have the	S (Symbol) property.

       [:xdigit:]
		 In addition  to  the  ASCII  hexadecimal  digits,  this  also
		 matches  the  "fullwidth" versions of those characters, whose
		 Unicode code points start at U+FF10. This is  a  change  that
		 was made in PCRE2 release 10.43 for Perl compatibility.

       The  other  POSIX  classes  are	unchanged by PCRE2_UCP,	and match only
       characters with code points less	than 256.

       There are two options that can be used to restrict the POSIX classes to
       ASCII  characters  when	PCRE2_UCP  is  set.   The   option   PCRE2_EX-
       TRA_ASCII_DIGIT	affects	 just  [:digit:] and [:xdigit:]. Within	a pat-
       tern, this can be set and unset by  (?aT)  and  (?-aT).	The  PCRE2_EX-
       TRA_ASCII_POSIX	option	disables UCP processing	for all	POSIX classes,
       including [:digit:] and [:xdigit:]. Within a pattern, (?aP) and	(?-aP)
       set and unset both these	options	for consistency.

COMPATIBILITY FEATURE FOR WORD BOUNDARIES

       In  the POSIX.2 compliant library that was included in 4.4BSD Unix, the
       ugly syntax [[:<:]] and [[:>:]] is used for matching  "start  of	 word"
       and "end	of word". PCRE2	treats these items as follows:

	 [[:<:]]  is converted to  \b(?=\w)
	 [[:>:]]  is converted to  \b(?<=\w)

       Only these exact	character sequences are	recognized. A sequence such as
       [a[:<:]b]  provokes  error  for	an unrecognized	POSIX class name. This
       support is not compatible with Perl. It is provided to help  migrations
       from other environments,	and is best not	used in	any new	patterns. Note
       that  \b	matches	at the start and the end of a word (see	"Simple	asser-
       tions" above), and in a Perl-style pattern the preceding	 or  following
       character  normally shows which is wanted, without the need for the as-
       sertions	that are used above in order to	give exactly the POSIX	behav-
       iour.  Note  also  that	the PCRE2_UCP option changes the meaning of \w
       (and therefore \b) by default, so  it  also  affects  these  POSIX  se-
       quences.

VERTICAL BAR

       Vertical	 bar characters	are used to separate alternative patterns. For
       example,	the pattern

	 gilbert|sullivan

       matches either "gilbert"	or "sullivan". Any number of alternatives  may
       appear,	and  an	 empty	alternative  is	 permitted (matching the empty
       string).	The matching process tries each	alternative in turn, from left
       to right, and the first one that	succeeds is used. If the  alternatives
       are  within a group (defined below), "succeeds" means matching the rest
       of the main pattern as well as the alternative in the group.

INTERNAL OPTION	SETTING

       The settings of several options can be changed within a	pattern	 by  a
       sequence	 of  letters  enclosed between "(?" and	")". The following are
       Perl-compatible,	and are	described in detail in the pcre2api documenta-
       tion. The option	letters	are:

	 i  for	PCRE2_CASELESS
	 m  for	PCRE2_MULTILINE
	 n  for	PCRE2_NO_AUTO_CAPTURE
	 s  for	PCRE2_DOTALL
	 x  for	PCRE2_EXTENDED
	 xx for	PCRE2_EXTENDED_MORE

       For example, (?im) sets caseless, multiline matching. It	is also	possi-
       ble to unset these options by preceding the relevant letters with a hy-
       phen, for example (?-im). The two "extended" options are	 not  indepen-
       dent; unsetting either one cancels the effects of both of them.

       A   combined  setting  and  unsetting  such  as	(?im-sx),  which  sets
       PCRE2_CASELESS and PCRE2_MULTILINE  while  unsetting  PCRE2_DOTALL  and
       PCRE2_EXTENDED,	is  also  permitted. Only one hyphen may appear	in the
       options string. If a letter appears both	before and after  the  hyphen,
       the  option  is unset. An empty options setting "(?)" is	allowed. Need-
       less to say, it has no effect.

       If the first character following	(? is a	circumflex, it causes  all  of
       the  above  options  to	be unset. Letters may follow the circumflex to
       cause some options to be	re-instated, but a hyphen may not appear.

       Some PCRE2-specific options can be changed by the same mechanism	 using
       these pairs or individual letters:

	 aD for	PCRE2_EXTRA_ASCII_BSD
	 aS for	PCRE2_EXTRA_ASCII_BSS
	 aW for	PCRE2_EXTRA_ASCII_BSW
	 aP for	PCRE2_EXTRA_ASCII_POSIX	and PCRE2_EXTRA_ASCII_DIGIT
	 aT for	PCRE2_EXTRA_ASCII_DIGIT
	 r  for	PCRE2_EXTRA_CASELESS_RESTRICT
	 J  for	PCRE2_DUPNAMES
	 U  for	PCRE2_UNGREEDY

       However,	 except	for 'r', these are not unset by	(?^), which is equiva-
       lent to (?-imnrsx). If 'a' is not followed by any  of  the  upper  case
       letters shown above, it sets (or	unsets)	all the	ASCII options.

       PCRE2_EXTRA_ASCII_DIGIT	 has   no  additional  effect  when  PCRE2_EX-
       TRA_ASCII_POSIX is set, but including it	in  (?aP)  means  that	(?-aP)
       suppresses all ASCII restrictions for POSIX classes.

       When  one of these option changes occurs	at top level (that is, not in-
       side group parentheses),	the change applies until a subsequent  change,
       or  the	end of the pattern. An option change within a group (see below
       for a description of groups) affects only that part of the  group  that
       follows	it.  At	 the  end  of the group	these options are reset	to the
       state they were before the group. For example,

	 (a(?i)b)c

       matches abc and aBc and no other	strings	 (assuming  PCRE2_CASELESS  is
       not  set	 externally).  Any changes made	in one alternative do carry on
       into subsequent branches	within the same	group. For example,

	 (a(?i)b|c)

       matches "ab", "aB", "c",	and "C", even though  when  matching  "C"  the
       first  branch  is  abandoned before the option setting. This is because
       the effects of option settings happen at	compile	time. There  would  be
       some very weird behaviour otherwise.

       As  a  convenient shorthand, if any option settings are required	at the
       start of	a non-capturing	group (see the next section), the option  let-
       ters may	appear between the "?" and the ":". Thus the two patterns

	 (?i:saturday|sunday)
	 (?:(?i)saturday|sunday)

       match exactly the same set of strings.

       Note:  There  are  other	 PCRE2-specific	options, applying to the whole
       pattern,	which can be set by the	application when the  compiling	 func-
       tion  is	 called.  In addition, the pattern can contain special leading
       sequences such as (*CRLF) to override what the application has  set  or
       what  has  been	defaulted.   Details are given in the section entitled
       "Newline	sequences" above. There	are also the (*UTF) and	(*UCP) leading
       sequences that can be used to set UTF and Unicode property modes;  they
       are  equivalent to setting the PCRE2_UTF	and PCRE2_UCP options, respec-
       tively.	However,  the  application  can	 set  the  PCRE2_NEVER_UTF  or
       PCRE2_NEVER_UCP	options,  which	 lock  out  the	 use of	the (*UTF) and
       (*UCP) sequences.

GROUPS

       Groups are delimited by parentheses  (round  brackets),	which  can  be
       nested.	Turning	part of	a pattern into a group does two	things:

       1. It localizes a set of	alternatives. For example, the pattern

	 cat(aract|erpillar|)

       matches	"cataract",  "caterpillar", or "cat". Without the parentheses,
       it would	match "cataract", "erpillar" or	an empty string.

       2. It creates a "capture	group".	This means that, when the  whole  pat-
       tern  matches, the portion of the subject string	that matched the group
       is passed back to the caller, separately	from the portion that  matched
       the  whole  pattern.   (This  applies  only to the traditional matching
       function; the DFA matching function does	not support capturing.)

       Opening parentheses are counted from left to right (starting from 1) to
       obtain numbers for capture groups. For example, if the string "the  red
       king" is	matched	against	the pattern

	 the ((red|white) (king|queen))

       the captured substrings are "red	king", "red", and "king", and are num-
       bered 1,	2, and 3, respectively.

       The  fact  that	plain  parentheses  fulfil two functions is not	always
       helpful.	 There are often times when grouping is	required without  cap-
       turing.	If an opening parenthesis is followed by a question mark and a
       colon, the group	does not do any	capturing, and	is  not	 counted  when
       computing  the number of	any subsequent capture groups. For example, if
       the string "the white queen" is matched against the pattern

	 the ((?:red|white) (king|queen))

       the captured substrings are "white queen" and "queen", and are numbered
       1 and 2.	The maximum number of capture groups is	65535.

       As a convenient shorthand, if any option	settings are required  at  the
       start  of  a non-capturing group, the option letters may	appear between
       the "?" and the ":". Thus the two patterns

	 (?i:saturday|sunday)
	 (?:(?i)saturday|sunday)

       match exactly the same set of strings. Because alternative branches are
       tried from left to right, and options are not reset until  the  end  of
       the  group is reached, an option	setting	in one branch does affect sub-
       sequent branches, so the	above patterns match "SUNDAY" as well as "Sat-
       urday".

DUPLICATE GROUP	NUMBERS

       Perl 5.10 introduced a feature whereby each alternative in a group uses
       the same	numbers	for its	capturing parentheses.	Such  a	 group	starts
       with  (?|  and  is  itself a non-capturing group. For example, consider
       this pattern:

	 (?|(Sat)ur|(Sun))day

       Because the two alternatives are	inside a (?| group, both sets of  cap-
       turing  parentheses  are	 numbered one. Thus, when the pattern matches,
       you can look at captured	substring number  one,	whichever  alternative
       matched.	 This  construct  is useful when you want to capture part, but
       not all,	of one of a number of alternatives. Inside a (?| group,	paren-
       theses are numbered as usual, but the number is reset at	the  start  of
       each  branch.  The numbers of any capturing parentheses that follow the
       whole group start after the highest number used in any branch. The fol-
       lowing example is taken from the	Perl documentation. The	numbers	under-
       neath show in which buffer the captured content will be stored.

	 # before  ---------------branch-reset----------- after
	 / ( a )  (?| x	( y ) z	| (p (q) r) | (t) u (v)	) ( z )	/x
	 # 1		2	  2  3	      2	    3	  4

       A backreference to a capture group uses the most	recent value  that  is
       set for the group. The following	pattern	matches	"abcabc" or "defdef":

	 /(?|(abc)|(def))\1/

       In  contrast, a subroutine call to a capture group always refers	to the
       first one in the	pattern	with the given number. The  following  pattern
       matches "abcabc"	or "defabc":

	 /(?|(abc)|(def))(?1)/

       A relative reference such as (?-1) is no	different: it is just a	conve-
       nient way of computing an absolute group	number.

       If a condition test for a group's having	matched	refers to a non-unique
       number, the test	is true	if any group with that number has matched.

       An  alternative approach	to using this "branch reset" feature is	to use
       duplicate named groups, as described in the next	section.

NAMED CAPTURE GROUPS

       Identifying capture groups by number is simple, but it can be very hard
       to keep track of	the numbers in complicated patterns.  Furthermore,  if
       an  expression  is  modified, the numbers may change. To	help with this
       difficulty, PCRE2 supports the naming of	capture	groups.	 This  feature
       was  not	 added to Perl until release 5.10. Python had the feature ear-
       lier, and PCRE1 introduced it at	release	4.0, using the Python  syntax.
       PCRE2 supports both the Perl and	the Python syntax.

       In  PCRE2,  a  capture  group  can  be  named  in  one  of  three ways:
       (?<name>...) or (?'name'...) as in Perl,	or (?P<name>...) as in Python.
       Names may be up to 128 code units long. When PCRE2_UTF is not set, they
       may contain only	ASCII alphanumeric  characters	and  underscores,  but
       must start with a non-digit. When PCRE2_UTF is set, the syntax of group
       names is	extended to allow any Unicode letter or	Unicode	decimal	digit.
       In other	words, group names must	match one of these patterns:

	 ^[_A-Za-z][_A-Za-z0-9]*\z   when PCRE2_UTF is not set
	 ^[_\p{L}][_\p{L}\p{Nd}]*\z  when PCRE2_UTF is set

       References  to  capture groups from other parts of the pattern, such as
       backreferences, recursion, and conditions, can all be made by  name  as
       well as by number.

       Named capture groups are	allocated numbers as well as names, exactly as
       if  the	names were not present.	In both	PCRE2 and Perl,	capture	groups
       are primarily identified	by numbers; any	names  are  just  aliases  for
       these numbers. The PCRE2	API provides function calls for	extracting the
       complete	 name-to-number	 translation table from	a compiled pattern, as
       well as convenience functions for  extracting  captured	substrings  by
       name.

       Warning:	 When  more than one capture group has the same	number,	as de-
       scribed in the previous section,	a name given to	one of them applies to
       all of them. Perl allows	identically numbered groups to have  different
       names.  Consider	this pattern, where there are two capture groups, both
       numbered	1:

	 (?|(?<AA>aa)|(?<BB>bb))

       Perl  allows  this,  with  both	names AA and BB	as aliases of group 1.
       Thus, after a successful	match, both names yield	the same value (either
       "aa" or "bb").

       In an attempt to	reduce confusion, PCRE2	does not allow the same	 group
       number to be associated with more than one name.	The example above pro-
       vokes  a	 compile-time  error. However, there is	still scope for	confu-
       sion. Consider this pattern:

	 (?|(?<AA>aa)|(bb))

       Although	the second group number	1 is not explicitly named, the name AA
       is still	an alias for any group 1. Whether the pattern matches "aa"  or
       "bb", a reference by name to group AA yields the	matched	string.

       By  default, a name must	be unique within a pattern, except that	dupli-
       cate names are permitted	for groups with	the same number, for example:

	 (?|(?<AA>aa)|(?<AA>bb))

       The duplicate name constraint can be disabled by	setting	the PCRE2_DUP-
       NAMES option at compile time, or	by the use of (?J) within the pattern,
       as described in the section entitled "Internal Option Setting" above.

       Duplicate names can be useful for patterns where	only one  instance  of
       the  named  capture group can match. Suppose you	want to	match the name
       of a weekday, either as a 3-letter abbreviation or as  the  full	 name,
       and  in	both  cases you	want to	extract	the abbreviation. This pattern
       (ignoring the line breaks) does the job:

	 (?J)
	 (?<DN>Mon|Fri|Sun)(?:day)?|
	 (?<DN>Tue)(?:sday)?|
	 (?<DN>Wed)(?:nesday)?|
	 (?<DN>Thu)(?:rsday)?|
	 (?<DN>Sat)(?:urday)?

       There are five capture groups, but only one is ever set after a	match.
       The  convenience	 functions for extracting the data by name returns the
       substring for the first (and in this example, the only) group  of  that
       name that matched. This saves searching to find which numbered group it
       was.  (An  alternative  way of solving this problem is to use a "branch
       reset" group, as	described in the previous section.)

       If you make a backreference to a	non-unique named group from  elsewhere
       in  the pattern,	the groups to which the	name refers are	checked	in the
       order in	which they appear in the overall pattern. The first  one  that
       is  set	is  used  for the reference. For example, this pattern matches
       both "foofoo" and "barbar" but not "foobar" or "barfoo":

	 (?J)(?:(?<n>foo)|(?<n>bar))\k<n>

       If you make a subroutine	call to	a non-unique named group, the one that
       corresponds to the first	occurrence of the name is used.	In the absence
       of duplicate numbers this is the	one with the lowest number.

       If you use a named reference in a condition test	(see the section about
       conditions below), either to check whether a capture group has matched,
       or to check for recursion, all groups with the same name	are tested. If
       the condition is	true for any one of them,  the	overall	 condition  is
       true.  This is the same behaviour as testing by number. For further de-
       tails of	the interfaces for handling  named  capture  groups,  see  the
       pcre2api	documentation.

REPETITION

       Repetition  is  specified  by  quantifiers, which may follow any	one of
       these items:

	 a literal data	character
	 the dot metacharacter
	 the \C	escape sequence
	 the \R	escape sequence
	 the \X	escape sequence
	 any escape sequence that matches a single character
	 a character class
	 a backreference
	 a parenthesized group (including lookaround assertions)
	 a subroutine call (recursive or otherwise)

       If a quantifier does not	follow a repeatable item, an error occurs. The
       general repetition quantifier specifies a minimum and maximum number of
       permitted matches by giving two numbers	in  curly  brackets  (braces),
       separated  by  a	 comma.	 The  numbers must be less than	65536, and the
       first must be less than or equal	to the second. For example,

	 z{2,4}

       matches "zz", "zzz", or "zzzz". A closing brace on its  own  is	not  a
       special	character.  If	the second number is omitted, but the comma is
       present,	there is no upper limit; if the	second number  and  the	 comma
       are  both omitted, the quantifier specifies an exact number of required
       matches.	Thus

	 [aeiou]{3,}

       matches at least	3 successive vowels, but may match many	more, whereas

	 \d{8}

       matches exactly 8 digits. If the	first number  is  omitted,  the	 lower
       limit is	taken as zero; in this case the	upper limit must be present.

	 X{,4} is interpreted as X{0,4}

       This  is	 a  change in behaviour	that happened in Perl 5.34.0 and PCRE2
       10.43. In earlier versions such a sequence was  not  interpreted	 as  a
       quantifier. Other regular expression engines may	behave either way.

       If  the characters that follow an opening brace do not match the	syntax
       of a quantifier,	the brace is taken as a	literal	character. In particu-
       lar, this means that {,}	is a literal string of three characters.

       Note that not every opening brace is potentially	the start of a quanti-
       fier because braces are used  in	 other	items  such  as	 \N{U+345}  or
       \k{name}.

       In UTF modes, quantifiers apply to characters rather than to individual
       code  units. Thus, for example, \x{100}{2} matches two characters, each
       of which	is represented by a two-byte sequence in a UTF-8 string. Simi-
       larly, \X{3} matches three Unicode extended grapheme clusters, each  of
       which  may  be  several	code  units long (and they may be of different
       lengths).

       The quantifier {0} is permitted,	causing	the expression to behave as if
       the previous item and the quantifier were not present. This may be use-
       ful for capture groups that are referenced as  subroutines  from	 else-
       where  in the pattern (but see also the section entitled	"Defining cap-
       ture groups for use by reference	only" below). Except for parenthesized
       groups, items that have a {0} quantifier	are omitted from the  compiled
       pattern.

       For  convenience, the three most	common quantifiers have	single-charac-
       ter abbreviations:

	 *    is equivalent to {0,}
	 +    is equivalent to {1,}
	 ?    is equivalent to {0,1}

       It is possible to construct infinite loops by following	a  group  that
       can  match no characters	with a quantifier that has no upper limit, for
       example:

	 (a?)*

       Earlier versions	of Perl	and PCRE1 used to give	an  error  at  compile
       time for	such patterns. However,	because	there are cases	where this can
       be useful, such patterns	are now	accepted, but whenever an iteration of
       such  a group matches no	characters, matching moves on to the next item
       in the pattern instead of repeatedly matching  an  empty	 string.  This
       does  not  prevent  backtracking	into any of the	iterations if a	subse-
       quent item fails	to match.

       By default, quantifiers are "greedy", that is, they match  as  much  as
       possible	 (up  to the maximum number of permitted repetitions), without
       causing the rest	of the pattern to fail.	The classic example  of	 where
       this gives problems is in trying	to match comments in C programs. These
       appear  between	/*  and	 */ and	within the comment, individual * and /
       characters may appear. An attempt to match C comments by	 applying  the
       pattern

	 /\*.*\*/

       to the string

	 /* first comment */  not comment  /* second comment */

       fails,  because it matches the entire string owing to the greediness of
       the .*  item. However, if a quantifier is followed by a question	 mark,
       it ceases to be greedy, and instead matches the minimum number of times
       possible, so the	pattern

	 /\*.*?\*/

       does  the right thing with C comments. The meaning of the various quan-
       tifiers is not otherwise	changed, just the preferred number of matches.
       Do not confuse this use of question mark	with its use as	 a  quantifier
       in  its	own  right.   Because it has two uses, it can sometimes	appear
       doubled,	as in

	 \d??\d

       which matches one digit by preference, but can match two	if that	is the
       only way	the rest of the	pattern	matches.

       If the PCRE2_UNGREEDY option is set (an option that is not available in
       Perl), the quantifiers are not greedy by	default, but  individual  ones
       can  be	made  greedy  by following them	with a question	mark. In other
       words, it inverts the default behaviour.

       When a parenthesized group is quantified	with a	minimum	 repeat	 count
       that  is	 greater  than 1 or with a limited maximum, more memory	is re-
       quired for the compiled pattern,	in proportion to the size of the mini-
       mum or maximum.

       If a pattern starts with	 .*  or	 .{0,}	and  the  PCRE2_DOTALL	option
       (equivalent  to	Perl's /s) is set, thus	allowing the dot to match new-
       lines, the pattern is implicitly	 anchored,  because  whatever  follows
       will  be	 tried against every character position	in the subject string,
       so there	is no point in retrying	the overall match at any position  af-
       ter  the	 first.	PCRE2 normally treats such a pattern as	though it were
       preceded	by \A.

       In cases	where it is known that the subject  string  contains  no  new-
       lines,  it  is worth setting PCRE2_DOTALL in order to obtain this opti-
       mization, or alternatively, using ^ to indicate anchoring explicitly.

       However,	there are some cases where the optimization  cannot  be	 used.
       When  .*	  is  inside  capturing	 parentheses that are the subject of a
       backreference elsewhere in the pattern, a match at the start  may  fail
       where a later one succeeds. Consider, for example:

	 (.*)abc\1

       If  the subject is "xyz123abc123" the match point is the	fourth charac-
       ter. For	this reason, such a pattern is not implicitly anchored.

       Another case where implicit anchoring is	not applied is when the	 lead-
       ing  .* is inside an atomic group. Once again, a	match at the start may
       fail where a later one succeeds.	Consider this pattern:

	 (?>.*?a)b

       It matches "ab" in the subject "aab". The use of	the backtracking  con-
       trol  verbs  (*PRUNE) and (*SKIP) also disable this optimization. To do
       so explicitly, either pass the compile option  PCRE2_NO_DOTSTAR_ANCHOR,
       or call pcre2_set_optimize() with a PCRE2_DOTSTAR_ANCHOR_OFF directive.

       When  a	capture	group is repeated, the value captured is the substring
       that matched the	final iteration. For example, after

	 (tweedle[dume]{3}\s*)+

       has matched "tweedledum tweedledee" the value of	the captured substring
       is "tweedledee".	However, if there are nested capture groups, the  cor-
       responding  captured  values  may have been set in previous iterations.
       For example, after

	 (a|(b))+

       matches "aba" the value of the second captured substring	is "b".

ATOMIC GROUPING	AND POSSESSIVE QUANTIFIERS

       With both maximizing ("greedy") and minimizing ("ungreedy"  or  "lazy")
       repetition,  failure  of	what follows normally causes the repeated item
       to be re-evaluated to see if a different	number of repeats  allows  the
       rest  of	 the pattern to	match. Sometimes it is useful to prevent this,
       either to change	the nature of the match, or to cause it	 fail  earlier
       than  it	otherwise might, when the author of the	pattern	knows there is
       no point	in carrying on.

       Consider, for example, the pattern \d+foo when applied to  the  subject
       line

	 123456bar

       After matching all 6 digits and then failing to match "foo", the	normal
       action  of  the matcher is to try again with only 5 digits matching the
       \d+ item, and then with	4,  and	 so  on,  before  ultimately  failing.
       "Atomic	grouping"  (a  term taken from Jeffrey Friedl's	book) provides
       the means for specifying	that once a group has matched, it is not to be
       re-evaluated in this way.

       If we use atomic	grouping for the previous example, the	matcher	 gives
       up  immediately	on failing to match "foo" the first time. The notation
       is a kind of special parenthesis, starting with (?> as in this example:

	 (?>\d+)foo

       Perl 5.28 introduced an experimental alphabetic form starting  with  (*
       which may be easier to remember:

	 (*atomic:\d+)foo

       This  kind of parenthesized group "locks	up" the	part of	the pattern it
       contains	once it	has matched, and a failure further into	the pattern is
       prevented from backtracking into	it. Backtracking past it  to  previous
       items, however, works as	normal.

       An alternative description is that a group of this type matches exactly
       the  string  of	characters  that an identical standalone pattern would
       match, if anchored at the current point in the subject string.

       Atomic groups are not capture groups. Simple cases such	as  the	 above
       example	can  be	 thought  of  as a maximizing repeat that must swallow
       everything it can.  So, while both \d+ and \d+? are prepared to	adjust
       the  number  of digits they match in order to make the rest of the pat-
       tern match, (?>\d+) can only match an entire sequence of	digits.

       Atomic groups in	general	can of course contain arbitrarily  complicated
       expressions, and	can be nested. However,	when the contents of an	atomic
       group  is  just a single	repeated item, as in the example above,	a sim-
       pler notation, called a "possessive quantifier" can be used. This  con-
       sists  of  an additional	+ character following a	quantifier. Using this
       notation, the previous example can be rewritten as

	 \d++foo

       Note that a possessive quantifier can be	used with an entire group, for
       example:

	 (abc|xyz){2,3}+

       Possessive quantifiers are always greedy; the setting of	the  PCRE2_UN-
       GREEDY  option  is ignored. They	are a convenient notation for the sim-
       pler forms of atomic group. However, there  is  no  difference  in  the
       meaning	of  a  possessive  quantifier and the equivalent atomic	group,
       though there may	be a performance  difference;  possessive  quantifiers
       should be slightly faster.

       The  possessive	quantifier syntax is an	extension to the Perl 5.8 syn-
       tax.  Jeffrey Friedl originated the idea	(and the name)	in  the	 first
       edition of his book. Mike McCloskey liked it, so	implemented it when he
       built  Sun's Java package, and PCRE1 copied it from there. It found its
       way into	Perl at	release	5.10.

       PCRE2 has an optimization  that	automatically  "possessifies"  certain
       simple  pattern constructs. For example,	the sequence A+B is treated as
       A++B because there is no	point in backtracking into a sequence  of  A's
       when   B	  must	 follow.    This   feature  can	 be  disabled  by  the
       PCRE2_NO_AUTO_POSSESS option, by	calling	 pcre2_set_optimize()  with  a
       PCRE2_AUTO_POSSESS_OFF  directive,  or  by  starting  the  pattern with
       (*NO_AUTO_POSSESS).

       When a pattern contains an unlimited repeat inside a group that can it-
       self be repeated	an unlimited number of times, the  use	of  an	atomic
       group  is the only way to avoid some failing matches taking a very long
       time indeed. The	pattern

	 (\D+|<\d+>)*[!?]

       matches an unlimited number of substrings that either consist  of  non-
       digits,	or  digits  enclosed in	<>, followed by	either ! or ?. When it
       matches,	it runs	quickly. However, if it	is applied to

	 aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa

       it takes	a long time before reporting  failure.	This  is  because  the
       string  can be divided between the internal \D+ repeat and the external
       * repeat	in a large number of ways, and all have	to be tried. (The  ex-
       ample uses [!?] rather than a single character at the end, because both
       PCRE2 and Perl have an optimization that	allows for fast	failure	when a
       single  character is used. They remember	the last single	character that
       is required for a match,	and fail early if it is	 not  present  in  the
       string.)	 If  the  pattern  is changed so that it uses an atomic	group,
       like this:

	 ((?>\D+)|<\d+>)*[!?]

       sequences of non-digits cannot be broken, and failure happens quickly.

BACKREFERENCES

       Outside a character class, a backslash followed by a digit greater than
       0 (and possibly further digits) is a backreference to a	capture	 group
       earlier (that is, to its	left) in the pattern, provided there have been
       that many previous capture groups.

       However,	 if the	decimal	number following the backslash is less than 8,
       it is always taken as a backreference, and  causes  an  error  only  if
       there  are not that many	capture	groups in the entire pattern. In other
       words, the group	that is	referenced need	not be to the left of the ref-
       erence for numbers less than 8. A "forward backreference" of this  type
       can make	sense when a repetition	is involved and	the group to the right
       has participated	in an earlier iteration.

       It  is  not  possible  to have a	numerical "forward backreference" to a
       group whose number is 8 or more using this syntax  because  a  sequence
       such  as	 \50  is  interpreted as a character defined in	octal. See the
       subsection entitled "Non-printing characters" above for further details
       of the handling of digits following a backslash.	Other forms  of	 back-
       referencing  do	not suffer from	this restriction. In particular, there
       is no problem when named	capture	groups are used	(see below).

       Another way of avoiding the ambiguity inherent in  the  use  of	digits
       following  a  backslash	is  to use the \g escape sequence. This	escape
       must be followed	by a signed or unsigned	number,	optionally enclosed in
       braces. These examples are all identical:

	 (ring), \1
	 (ring), \g1
	 (ring), \g{1}

       An unsigned number specifies an absolute	reference without the  ambigu-
       ity that	is present in the older	syntax.	It is also useful when literal
       digits  follow  the reference. A	signed number is a relative reference.
       Consider	this example:

	 (abc(def)ghi)\g{-1}

       The sequence \g{-1} is a	reference to the capture group whose number is
       one less	than the number	of the next group to be	started,  so  in  this
       example	(where the next	group would be numbered	3) is it equivalent to
       \2, and \g{-2} would be equivalent to \1. Note that if  this  construct
       is  inside  a capture group, that group is included in the count, so in
       this example \g{-2} also	refers to group	1:

	 (A)(\g{-2}B)

       The use of relative references can be helpful  in  long	patterns,  and
       also  in	 patterns  that	are created by joining together	fragments that
       contain references within themselves.

       The sequence \g{+1} is a	reference to the next capture  group  that  is
       started	after  this item, and \g{+2} refers to the one after that, and
       so on. This kind	of forward reference can be useful  in	patterns  that
       repeat. Perl does not support the use of	+ in this way.

       A  backreference	 matches  whatever  actually most recently matched the
       capture group in	the current subject string, rather  than  anything  at
       all that	matches	the group (see "Groups as subroutines" below for a way
       of doing	that). So the pattern

	 (sens|respons)e and \1ibility

       matches	"sense and sensibility"	and "response and responsibility", but
       not "sense and responsibility". If caseful matching is in force at  the
       time  of	 the backreference, the	case of	letters	is relevant. For exam-
       ple,

	 ((?i)rah)\s+\1

       matches "rah rah" and "RAH RAH",	but not	"RAH  rah",  even  though  the
       original	capture	group is matched caselessly.

       There  are  several  different  ways of writing backreferences to named
       capture groups. The .NET	syntax	is  \k{name},  the  Python  syntax  is
       (?=name),  and the original Perl	syntax is \k<name> or \k'name'.	All of
       these are now supported by both Perl and	 PCRE2.	 Perl  5.10's  unified
       backreference  syntax,  in  which  \g  can be used for both numeric and
       named references, is also supported by PCRE2.   We  could  rewrite  the
       above example in	any of the following ways:

	 (?<p1>(?i)rah)\s+\k<p1>
	 (?'p1'(?i)rah)\s+\k{p1}
	 (?P<p1>(?i)rah)\s+(?P=p1)
	 (?<p1>(?i)rah)\s+\g{p1}

       A  capture  group  that is referenced by	name may appear	in the pattern
       before or after the reference.

       There may be more than one backreference	to the same group. If a	 group
       has  not	actually been used in a	particular match, backreferences to it
       always fail by default. For example, the	pattern

	 (a|(bc))\2

       always fails if it starts to match "a" rather than  "bc".  However,  if
       the PCRE2_MATCH_UNSET_BACKREF option is set at compile time, a backref-
       erence to an unset value	matches	an empty string.

       Because	there may be many capture groups in a pattern, all digits fol-
       lowing a	backslash are taken as part of a potential backreference  num-
       ber.  If	 the  pattern continues	with a digit character,	some delimiter
       must be used to terminate the backreference. If the  PCRE2_EXTENDED  or
       PCRE2_EXTENDED_MORE  option is set, this	can be white space. Otherwise,
       the \g{}	syntax or an empty comment (see	"Comments" below) can be used.

   Recursive backreferences

       A backreference that occurs inside the group to which it	 refers	 fails
       when  the  group	 is  first used, so, for example, (a\1)	never matches.
       However,	such references	can be useful inside repeated groups. For  ex-
       ample, the pattern

	 (a|b\1)+

       matches any number of "a"s and also "aba", "ababbaa" etc. At each iter-
       ation of	the group, the backreference matches the character string cor-
       responding  to  the  previous iteration.	In order for this to work, the
       pattern must be such that the first iteration does not  need  to	 match
       the  backreference. This	can be done using alternation, as in the exam-
       ple above, or by	a quantifier with a minimum of zero.

       For versions of PCRE2 less than 10.25, backreferences of	this type used
       to cause	the group that they reference  to  be  treated	as  an	atomic
       group.	This restriction no longer applies, and	backtracking into such
       groups can occur	as normal.

ASSERTIONS

       An assertion is a test that does	not consume any	characters.  The  test
       must  succeed for the match to continue.	The simple assertions coded as
       \b, \B, \A, \G, \Z, \z, ^ and $ are described above.

       More complicated	assertions  are	 coded	as  parenthesized  groups.  If
       matching	 such  a group succeeds, matching continues after it, but with
       the matching position in	the subject string reset to what it was	before
       the assertion was processed.

       A special kind of  assertion,  called  a	 "scan	substring"  assertion,
       matches	a  subpattern against a	previously captured substring. This is
       described in the	section	entitled "Scan substring assertions" below. It
       is a PCRE2 extension, not compatible with Perl.

       The other goup-based assertions are of two kinds: those that look ahead
       of the current position in the subject string, and those	that look  be-
       hind  it, and in	each case an assertion may be positive (must match for
       the assertion to	be true) or negative (must not match for the assertion
       to be true).

       The Perl-compatible lookaround assertions are atomic. If	 an  assertion
       is  true, but there is a	subsequent matching failure, there is no back-
       tracking	into the assertion. However, there are some cases  where  non-
       atomic  assertions can be useful. PCRE2 has some	support	for these, de-
       scribed in the section entitled "Non-atomic assertions" below, but they
       are not Perl-compatible.

       A lookaround assertion may appear as the	 condition  in	a  conditional
       group  (see  below). In this case, the result of	matching the assertion
       determines which	branch of the condition	is followed.

       Assertion groups	are not	capture	groups.	If an assertion	contains  cap-
       ture  groups within it, these are counted for the purposes of numbering
       the capture groups in the whole pattern.	Within each branch of  an  as-
       sertion,	 locally  captured  substrings	may be referenced in the usual
       way. For	example, a sequence such as (.)\g{-1} can  be  used  to	 check
       that two	adjacent characters are	the same.

       When  a	branch within an assertion fails to match, any substrings that
       were captured are discarded (as happens with any	 pattern  branch  that
       fails  to  match).  A  negative	assertion  is  true  only when all its
       branches	fail to	match; this means that no captured substrings are ever
       retained	after a	successful negative assertion. When an assertion  con-
       tains a matching	branch,	what happens depends on	the type of assertion.

       For  a  positive	 assertion, internally captured	substrings in the suc-
       cessful branch are retained, and	matching continues with	the next  pat-
       tern  item  after  the  assertion. For a	negative assertion, a matching
       branch means that the assertion is not true. If such  an	 assertion  is
       being  used as a	condition in a conditional group (see below), captured
       substrings are retained,	 because  matching  continues  with  the  "no"
       branch of the condition.	For other failing negative assertions, control
       passes to the previous backtracking point, thus discarding any captured
       strings within the assertion.

       Most  assertion groups may be repeated; though it makes no sense	to as-
       sert the	same thing several times, the side effect of capturing in pos-
       itive assertions	may occasionally be useful. However, an	assertion that
       forms the condition for a conditional  group  may  not  be  quantified.
       PCRE2  used  to restrict	the repetition of assertions, but from release
       10.35 the only restriction is that an unlimited maximum	repetition  is
       changed	to  be one more	than the minimum. For example, {3,} is treated
       as {3,4}.

   Alphabetic assertion	names

       Traditionally, symbolic sequences such as (?= and (?<= have  been  used
       to  specify lookaround assertions. Perl 5.28 introduced some experimen-
       tal alphabetic alternatives which might be easier to remember. They all
       start with (* instead of	(? and must be written using lower  case  let-
       ters. PCRE2 supports the	following synonyms:

	 (*positive_lookahead:	or (*pla: is the same as (?=
	 (*negative_lookahead:	or (*nla: is the same as (?!
	 (*positive_lookbehind:	or (*plb: is the same as (?<=
	 (*negative_lookbehind:	or (*nlb: is the same as (?<!

       For  example,  (*pla:foo) is the	same assertion as (?=foo). In the fol-
       lowing sections,	the various assertions are described using the	origi-
       nal symbolic forms.

   Lookahead assertions

       Lookahead assertions start with (?= for positive	assertions and (?! for
       negative	assertions. For	example,

	 \w+(?=;)

       matches	a word followed	by a semicolon,	but does not include the semi-
       colon in	the match, and

	 foo(?!bar)

       matches any occurrence of "foo" that is not  followed  by  "bar".  Note
       that the	apparently similar pattern

	 (?!foo)bar

       does  not  find	an  occurrence	of "bar" that is preceded by something
       other than "foo"; it finds any occurrence of "bar" whatsoever,  because
       the assertion (?!foo) is	always true when the next three	characters are
       "bar". A	lookbehind assertion is	needed to achieve the other effect.

       If you want to force a matching failure at some point in	a pattern, the
       most  convenient	 way to	do it is with (?!) because an empty string al-
       ways matches, so	an assertion that requires there not to	 be  an	 empty
       string must always fail.	 The backtracking control verb (*FAIL) or (*F)
       is a synonym for	(?!).

   Lookbehind assertions

       Lookbehind  assertions start with (?<= for positive assertions and (?<!
       for negative assertions.	For example,

	 (?<!foo)bar

       does find an occurrence of "bar"	that is	not  preceded  by  "foo".  The
       contents	 of a lookbehind assertion are restricted such that there must
       be a known maximum to the lengths of all	the strings it matches.	 There
       are two cases:

       If every	top-level alternative matches a	fixed length, for example

	 (?<=colour|color)

       there  is a limit of 65535 characters to	the lengths, which do not have
       to be the same, as this example demonstrates. This is the only kind  of
       lookbehind  supported  by  PCRE2	versions earlier than 10.43 and	by the
       alternative matching function pcre2_dfa_match().

       In PCRE2	10.43 and later, pcre2_match() supports	lookbehind  assertions
       in  which  one  or  more	top-level alternatives can match more than one
       string length, for example

	 (?<=colou?r)

       The maximum matching length for any branch of the lookbehind is limited
       to a value set by the calling program (default 255 characters).	Unlim-
       ited  repetition	(for example \d*) is not supported. In some cases, the
       escape sequence \K (see above) can be used instead of a lookbehind  as-
       sertion	at  the	 start	of a pattern to	get round the length limit re-
       striction.

       In UTF-8	and UTF-16 modes, PCRE2	does not allow the  \C	escape	(which
       matches	a single code unit even	in a UTF mode) to appear in lookbehind
       assertions, because it makes it impossible to calculate the  length  of
       the  lookbehind.	 The \X	and \R escapes,	which can match	different num-
       bers of code units, are never permitted in lookbehinds.

       "Subroutine" calls (see below) such as (?2) or (?&X) are	 permitted  in
       lookbehinds,  as	 long  as  the called capture group matches a limited-
       length string. However, recursion, that is, a "subroutine" call into  a
       group that is already active, is	not supported.

       PCRE2  supports backreferences in lookbehinds, but only if certain con-
       ditions are met.	The PCRE2_MATCH_UNSET_BACKREF option must not be  set,
       there  must be no use of	(?| in the pattern (it creates duplicate group
       numbers), and if	the backreference is by	name, the name must be unique.
       Of course, the referenced group must itself match a limited length sub-
       string. The following pattern matches words  containing	at  least  two
       characters that begin and end with the same character:

	  \b(\w)\w++(?<=\1)

       Possessive  quantifiers	can be used in conjunction with	lookbehind as-
       sertions	to specify efficient matching at the end of  subject  strings.
       Consider	a simple pattern such as

	 abcd$

       when  applied  to  a  long string that does not match. Because matching
       proceeds	from left to right, PCRE2 will look for	each "a" in  the  sub-
       ject  and  then see if what follows matches the rest of the pattern. If
       the pattern is specified	as

	 ^.*abcd$

       the initial .* matches the entire string	at first, but when this	 fails
       (because	there is no following "a"), it backtracks to match all but the
       last  character,	 then all but the last two characters, and so on. Once
       again the search	for "a"	covers the entire string, from right to	 left,
       so we are no better off.	However, if the	pattern	is written as

	 ^.*+(?<=abcd)

       there can be no backtracking for	the .*+	item because of	the possessive
       quantifier; it can match	only the entire	string.	The subsequent lookbe-
       hind  assertion	does  a	single test on the last	four characters. If it
       fails, the match	fails immediately. For	long  strings,	this  approach
       makes a significant difference to the processing	time.

   Using multiple assertions

       Several assertions (of any sort)	may occur in succession. For example,

	 (?<=\d{3})(?<!999)foo

       matches	"foo" preceded by three	digits that are	not "999". Notice that
       each of the assertions is applied independently at the  same  point  in
       the  subject  string.  First  there  is a check that the	previous three
       characters are all digits, and then there is  a	check  that  the  same
       three characters	are not	"999".	This pattern does not match "foo" pre-
       ceded  by  six  characters,  the	first of which are digits and the last
       three of	which are not "999". For example, it  doesn't  match  "123abc-
       foo". A pattern to do that is

	 (?<=\d{3}...)(?<!999)foo

       This  time  the	first assertion	looks at the preceding six characters,
       checking	that the first three are digits, and then the second assertion
       checks that the preceding three characters are not "999".

       Assertions can be nested	in any combination. For	example,

	 (?<=(?<!foo)bar)baz

       matches an occurrence of	"baz" that is preceded by "bar"	which in  turn
       is not preceded by "foo", while

	 (?<=\d{3}(?!999)...)foo

       is  another pattern that	matches	"foo" preceded by three	digits and any
       three characters	that are not "999".

NON-ATOMIC ASSERTIONS

       Traditional lookaround assertions are atomic. That is, if an  assertion
       is  true, but there is a	subsequent matching failure, there is no back-
       tracking	into the assertion. However, there are some cases  where  non-
       atomic  positive	 assertions  can be useful. PCRE2 provides these using
       the following syntax:

	 (*non_atomic_positive_lookahead:  or (*napla: or (?*
	 (*non_atomic_positive_lookbehind: or (*naplb: or (?<*

       Consider	the problem of finding the right-most word in  a  string  that
       also  appears  earlier  in the string, that is, it must appear at least
       twice in	total.	This pattern returns the required result  as  captured
       substring 1:

	 ^(?x)(*napla: .* \b(\w++)) (?>	.*? \b\1\b ){2}

       For  a subject such as "word1 word2 word3 word2 word3 word4" the	result
       is "word3". How does it work? At	the start, ^(?x) anchors  the  pattern
       and sets	the "x"	option,	which causes white space (introduced for read-
       ability)	 to  be	 ignored. Inside the assertion,	the greedy .* at first
       consumes	the entire string, but then has	to backtrack until the rest of
       the assertion can match a word, which is	captured by group 1. In	 other
       words,  when  the  assertion first succeeds, it captures	the right-most
       word in the string.

       The current matching point is then reset	to the start of	 the  subject,
       and  the	 rest  of  the pattern match checks for	two occurrences	of the
       captured	word, using an ungreedy	.*? to scan from  the  left.  If  this
       succeeds,  we are done, but if the last word in the string does not oc-
       cur twice, this part of the pattern  fails.  If	a  traditional	atomic
       lookahead  (?=  or (*pla: had been used,	the assertion could not	be re-
       entered,	and the	whole match would fail.	The pattern would succeed only
       if the very last	word in	the subject was	found twice.

       Using a non-atomic lookahead, however, means that when  the  last  word
       does  not  occur	 twice	in the string, the lookahead can backtrack and
       find the	second-last word, and so on, until either the match  succeeds,
       or all words have been tested.

       Two conditions must be met for a	non-atomic assertion to	be useful: the
       contents	 of one	or more	capturing groups must change after a backtrack
       into the	assertion, and there must be  a	 backreference	to  a  changed
       group  later  in	 the pattern. If this is not the case, the rest	of the
       pattern match fails exactly as before because nothing has  changed,  so
       using a non-atomic assertion just wastes	resources.

       There  is one exception to backtracking into a non-atomic assertion. If
       an (*ACCEPT) control verb is triggered, the assertion  succeeds	atomi-
       cally.  That  is,  a subsequent match failure cannot backtrack into the
       assertion.

       Non-atomic assertions are not supported	by  the	 alternative  matching
       function	pcre2_dfa_match(). They	are supported by JIT, but only if they
       do not contain any control verbs	such as	(*ACCEPT). (This may change in
       future).	Note that assertions that appear as conditions for conditional
       groups (see below) must be atomic.

SCAN SUBSTRING ASSERTIONS

       A  special kind of assertion, not compatible with Perl, makes it	possi-
       ble to check the	contents of a captured substring by matching it	with a
       subpattern.  Because this involves capturing, this feature is not  sup-
       ported by pcre2_dfa_match().

       A  scan	substring assertion starts with	the sequence (*scan_substring:
       or (*scs: which is followed by a	list of	substring numbers (absolute or
       relative) and/or	substring names	enclosed in  single  quotes  or	 angle
       brackets,  all  within parentheses. The rest of the item	is the subpat-
       tern that is applied to the substring, as shown in these	examples:

	 (*scan_substring:(1)...)
	 (*scs:(-2)...)
	 (*scs:('AB')...)
	 (*scs:(1,'AB',-2)...)

       The list	of groups is checked in	the order they are given,  and	it  is
       the contents of the first one that is found to be set that are scanned.
       When  PCRE2_DUPNAMES  is	 set  and there	are ambiguous group names, all
       groups with the same name are checked in	numerical order. A  scan  sub-
       string  assertion  fails	 if none of the	groups it references have been
       set.

       The pattern match on the	substring is always anchored, that is, it must
       match from the start of the substring. There is no  "bumpalong"	if  it
       does  not match at the start. The end of	the subject is temporarily re-
       set to be the end of the	substring, so \Z, \z, and $ will match	there.
       However,	 the  start  of	 the  subject  is not reset. This means	that ^
       matches only if the substring is	actually at the	start of the main sub-
       ject, but it also means that lookbehind assertions into	what  precedes
       the substring are possible.

       Here  is	 a very	simple example:	find a word that contains the rare (in
       English)	sequence of letters "rh" not at	the start:

	 \b(\w++)(*scs:(1).+rh)

       The first group captures	a word which is	then  scanned  by  the	second
       group.	This  example does not actually	need this heavyweight feature;
       the same	match can be achieved with:

	 \b\w+?rh\w*\b

       When things are more complicated, however,  scanning  a	captured  sub-
       string  can be a	useful way to describe the required match. For exmple,
       there is	a rather complicated pattern  in  the  PCRE2  test  data  that
       checks an entire	subject	string for a palindrome, that is, the sequence
       of  letters  is the same	in both	directions. Suppose you	want to	search
       for individual words of two or more characters such as "level" that are
       palindromes:

	 (\b\w{2,}+\b)(*scs:(1)...palindrome-matching-pattern...)

       Within a	substring scanning subpattern, references to other groups work
       as normal. Capturing groups may appear, and will	 retain	 their	values
       during ongoing matching if the assertion	succeeds.

SCRIPT RUNS

       In  concept, a script run is a sequence of characters that are all from
       the same	Unicode	script such as Latin or	Greek. However,	 because  some
       scripts	are  commonly  used together, and because some diacritical and
       other marks are used with multiple scripts,  it	is  not	 that  simple.
       There is	a full description of the rules	that PCRE2 uses	in the section
       entitled	"Script	Runs" in the pcre2unicode documentation.

       If  part	 of a pattern is enclosed between (*script_run:	or (*sr: and a
       closing parenthesis, it fails if	the sequence  of  characters  that  it
       matches	are not	a script run. After a failure, normal backtracking oc-
       curs. Script runs can be	used to	detect spoofing	attacks	using  charac-
       ters  that  look	 the  same, but	are from different scripts. The	string
       "paypal.com" is an infamous example, where the letters could be a  mix-
       ture of Latin and Cyrillic. This	pattern	ensures	that the matched char-
       acters in a sequence of non-spaces that follow white space are a	script
       run:

	 \s+(*sr:\S+)

       To  be  sure  that  they	are all	from the Latin script (for example), a
       lookahead can be	used:

	 \s+(?=\p{Latin})(*sr:\S+)

       This works as long as the first character is expected to	be a character
       in that script, and not (for example)  punctuation,  which  is  allowed
       with  any script. If this is not	the case, a more creative lookahead is
       needed. For example, if digits, underscore, and dots are	 permitted  at
       the start:

	 \s+(?=[0-9_.]*\p{Latin})(*sr:\S+)

       In  many	 cases,	backtracking into a script run pattern fragment	is not
       desirable. The script run can employ an atomic group to	prevent	 this.
       Because	this is	a common requirement, a	shorthand notation is provided
       by (*atomic_script_run: or (*asr:

	 (*asr:...) is the same	as (*sr:(?>...))

       Note that the atomic group is inside the	script run. Putting it outside
       would not prevent backtracking into the script run pattern.

       Support for script runs is not available	if PCRE2 is  compiled  without
       Unicode support.	A compile-time error is	given if any of	the above con-
       structs	is encountered.	Script runs are	not supported by the alternate
       matching	function, pcre2_dfa_match() because they use the  same	mecha-
       nism as capturing parentheses.

       Warning:	 The  (*ACCEPT)	 control  verb	(see below) should not be used
       within a	script run group, because it causes an immediate exit from the
       group, bypassing	the script run checking.

CONDITIONAL GROUPS

       It is possible to cause the matching process to obey a pattern fragment
       conditionally or	to choose between two alternative fragments, depending
       on the result of	an assertion, or whether a specific capture group  has
       already been matched. The two possible forms of conditional group are:

	 (?(condition)yes-pattern)
	 (?(condition)yes-pattern|no-pattern)

       If  the	condition is satisfied,	the yes-pattern	is used; otherwise the
       no-pattern (if present) is used.	An absent no-pattern is	equivalent  to
       an  empty string	(it always matches). If	there are more than two	alter-
       natives in the group, a compile-time error occurs. Each of the two  al-
       ternatives may itself contain nested groups of any form,	including con-
       ditional	 groups;  the  restriction to two alternatives applies only at
       the level of the	condition itself. This pattern fragment	is an  example
       where the alternatives are complex:

	 (?(1) (A|B|C) | (D | (?(2)E|F)	| E) )

       There are five kinds of condition: references to	capture	groups,	refer-
       ences  to  recursion,  two pseudo-conditions called DEFINE and VERSION,
       and assertions.

   Checking for	a used capture group by	number

       If the text between the parentheses consists of a sequence  of  digits,
       the  condition is true if a capture group of that number	has previously
       matched.	If there is more than one capture group	with the  same	number
       (see  the earlier section about duplicate group numbers), the condition
       is true if any of them have matched. An alternative notation, which  is
       a PCRE2 extension, not supported	by Perl, is to precede the digits with
       a plus or minus sign. In	this case, the group number is relative	rather
       than  absolute.	The most recently opened capture group (which could be
       enclosing this condition) can be	referenced by (?(-1),  the  next  most
       recent by (?(-2), and so	on. Inside loops it can	also make sense	to re-
       fer  to	subsequent groups.  The	next capture group to be opened	can be
       referenced as (?(+1), and so on.	The value zero in any of  these	 forms
       is not used; it provokes	a compile-time error.

       Consider	 the  following	 pattern, which	contains non-significant white
       space to	make it	more readable (assume the PCRE2_EXTENDED  option)  and
       to divide it into three parts for ease of discussion:

	 ( \( )?    [^()]+    (?(1) \) )

       The  first  part	 matches  an optional opening parenthesis, and if that
       character is present, sets it as	the first captured substring. The sec-
       ond part	matches	one or more characters that are	not  parentheses.  The
       third  part  is a conditional group that	tests whether or not the first
       capture group matched. If it did, that is, if subject started  with  an
       opening	parenthesis,  the condition is true, and so the	yes-pattern is
       executed	and a closing parenthesis is required.	Otherwise,  since  no-
       pattern is not present, the conditional group matches nothing. In other
       words,  this  pattern matches a sequence	of non-parentheses, optionally
       enclosed	in parentheses.

       If you were embedding this pattern in a larger one,  you	 could	use  a
       relative	reference:

	 ...other stuff... ( \(	)?    [^()]+	(?(-1) \) ) ...

       This  makes  the	 fragment independent of the parentheses in the	larger
       pattern.

   Checking for	a used capture group by	name

       Perl uses the syntax (?(<name>)...) or (?('name')...)  to  test	for  a
       used  capture group by name. For	compatibility with earlier versions of
       PCRE1, which had	this facility before Perl, the syntax (?(name)...)  is
       also  recognized.   Note, however, that undelimited names consisting of
       the letter R followed by	digits are ambiguous (see the  following  sec-
       tion). Rewriting	the above example to use a named group gives this:

	 (?<OPEN> \( )?	   [^()]+    (?(<OPEN>)	\) )

       If  the	name used in a condition of this kind is a duplicate, the test
       is applied to all groups	of the same name, and is true if  any  one  of
       them has	matched.

   Checking for	pattern	recursion

       "Recursion"  in	this sense refers to any subroutine-like call from one
       part of the pattern to another, whether or not it  is  actually	recur-
       sive.  See  the	sections  entitled "Recursive patterns"	and "Groups as
       subroutines" below for details of recursion and subroutine calls.

       If a condition is the string (R), and there is no  capture  group  with
       the  name R, the	condition is true if matching is currently in a	recur-
       sion or subroutine call to the whole pattern or any capture  group.  If
       digits  follow  the letter R, and there is no group with	that name, the
       condition is true if the	most recent call is  into  a  group  with  the
       given  number,  which must exist	somewhere in the overall pattern. This
       is a contrived example that is equivalent to a+b:

	 ((?(R1)a+|(?1)b))

       However,	in both	cases, if there	is a capture  group  with  a  matching
       name,  the  condition tests for its being set, as described in the sec-
       tion above, instead of testing for recursion. For example,  creating  a
       group  with  the	 name  R1  by adding (?<R1>) to	the above pattern com-
       pletely changes its meaning.

       If a name preceded by ampersand follows the letter R, for example:

	 (?(R&name)...)

       the condition is	true if	the most recent	recursion is into a  group  of
       that name (which	must exist within the pattern).

       This condition does not check the entire	recursion stack. It tests only
       the  current  level.  If	the name used in a condition of	this kind is a
       duplicate, the test is applied to all groups of the same	name,  and  is
       true if any one of them is the most recent recursion.

       At "top level", all these recursion test	conditions are false.

   Defining capture groups for use by reference	only

       If the condition	is the string (DEFINE),	the condition is always	false,
       even  if	there is a group with the name DEFINE. In this case, there may
       be only one alternative in the rest of the conditional group. It	is al-
       ways skipped if control reaches this point in the pattern; the idea  of
       DEFINE  is that it can be used to define	subroutines that can be	refer-
       enced from elsewhere. (The use of subroutines is	described below.)  For
       example,	 a  pattern  to	match an IPv4 address such as "192.168.23.245"
       could be	written	like this (ignore white	space and line breaks):

	 (?(DEFINE) (?<byte> 2[0-4]\d |	25[0-5]	| 1\d\d	| [1-9]?\d) )
	 \b (?&byte) (\.(?&byte)){3} \b

       The first part of the pattern is	a DEFINE group	inside	which  another
       group  named "byte" is defined. This matches an individual component of
       an IPv4 address (a number less than 256). When  matching	 takes	place,
       this  part  of  the pattern is skipped because DEFINE acts like a false
       condition. The rest of the pattern uses references to the  named	 group
       to  match the four dot-separated	components of an IPv4 address, insist-
       ing on a	word boundary at each end.

   Checking the	PCRE2 version

       Programs	that link with a PCRE2 library can check the version by	 call-
       ing  pcre2_config()  with  appropriate arguments. Users of applications
       that do not have	access to the underlying code cannot do	this.  A  spe-
       cial  "condition" called	VERSION	exists to allow	such users to discover
       which version of	PCRE2 they are dealing with by using this condition to
       match a string such as "yesno". VERSION must be followed	either by  "="
       or ">=" and a version number.  For example:

	 (?(VERSION>=10.4)yes|no)

       This  pattern matches "yes" if the PCRE2	version	is greater or equal to
       10.4, or	"no" otherwise.	The fractional part of the version number  may
       not contain more	than two digits.

   Assertion conditions

       If  the	condition  is  not  in	any of the above formats, it must be a
       parenthesized assertion.	This may be a positive or  negative  lookahead
       or  lookbehind  assertion. However, it must be a	traditional atomic as-
       sertion,	not one	of the non-atomic assertions.

       Consider	this pattern, again containing	non-significant	 white	space,
       and with	the two	alternatives on	the second line:

	 (?(?=[^a-z]*[a-z])
	 \d{2}-[a-z]{3}-\d{2}  |  \d{2}-\d{2}-\d{2} )

       The  condition  is  a  positive lookahead assertion that	matches	an op-
       tional sequence of non-letters followed by a letter. In other words, it
       tests for the presence of at least one letter in	the subject. If	a let-
       ter is found, the subject is matched  against  the  first  alternative;
       otherwise  it  is  matched  against  the	 second.  This pattern matches
       strings in one of the two forms dd-aaa-dd or dd-dd-dd,  where  aaa  are
       letters and dd are digits.

       When an assertion that is a condition contains capture groups, any cap-
       turing  that  occurs  in	 a matching branch is retained afterwards, for
       both positive and negative assertions, because matching always  contin-
       ues  after  the	assertion, whether it succeeds or fails. (Compare non-
       conditional assertions, for which captures are retained only for	 posi-
       tive assertions that succeed.)

COMMENTS

       There are two ways of including comments	in patterns that are processed
       by  PCRE2.  In  both  cases,  the start of the comment must not be in a
       character class,	nor in the middle of any  other	 sequence  of  related
       characters  such	as (?: or a group name or number or a Unicode property
       name. The characters that make up a comment play	no part	in the pattern
       matching.

       The sequence (?#	marks the start	of a comment that continues up to  the
       next  closing parenthesis. Nested parentheses are not permitted.	If the
       PCRE2_EXTENDED or PCRE2_EXTENDED_MORE option is	set,  an  unescaped  #
       character  also	introduces  a comment, which in	this case continues to
       immediately after the next newline character or character  sequence  in
       the pattern. Which characters are interpreted as	newlines is controlled
       by  an option passed to the compiling function or by a special sequence
       at the start of the pattern, as described in the	section	entitled "New-
       line conventions" above.	Note that the end of this type of comment is a
       literal newline sequence	in the pattern;	escape sequences  that	happen
       to represent a newline do not count. For	example, consider this pattern
       when  PCRE2_EXTENDED is set, and	the default newline convention (a sin-
       gle linefeed character) is in force:

	 abc #comment \n still comment

       On encountering the # character,	pcre2_compile()	skips  along,  looking
       for  a newline in the pattern. The sequence \n is still literal at this
       stage, so it does not terminate the comment. Only an  actual  character
       with the	code value 0x0a	(the default newline) does so.

RECURSIVE PATTERNS

       Consider	 the problem of	matching a string in parentheses, allowing for
       unlimited nested	parentheses. Without the use of	 recursion,  the  best
       that  can  be  done  is	to use a pattern that matches up to some fixed
       depth of	nesting. It is not possible to	handle	an  arbitrary  nesting
       depth.

       For some	time, Perl has provided	a facility that	allows regular expres-
       sions  to recurse (amongst other	things). It does this by interpolating
       Perl code in the	expression at run time,	and the	code can refer to  the
       expression itself. A Perl pattern using code interpolation to solve the
       parentheses problem can be created like this:

	 $re = qr{\( (?: (?>[^()]+) | (?p{$re})	)* \)}x;

       The (?p{...}) item interpolates Perl code at run	time, and in this case
       refers recursively to the pattern in which it appears.

       Obviously,  PCRE2  cannot  support  the interpolation of	Perl code. In-
       stead, it supports special syntax for recursion of the entire  pattern,
       and also	for individual capture group recursion.	After its introduction
       in PCRE1	and Python, this kind of recursion was subsequently introduced
       into Perl at release 5.10.

       A  special  item	 that consists of (? followed by a number greater than
       zero and	a closing parenthesis is a recursive subroutine	 call  of  the
       capture	group of the given number, provided that it occurs inside that
       group. (If not, it is a non-recursive subroutine	 call,	which  is  de-
       scribed in the next section.) The special item (?R) or (?0) is a	recur-
       sive call of the	entire regular expression.

       This  PCRE2  pattern  solves the	nested parentheses problem (assume the
       PCRE2_EXTENDED option is	set so that white space	is ignored):

	 \( ( [^()]++ |	(?R) )*	\)

       First it	matches	an opening parenthesis.	Then it	matches	any number  of
       substrings  which can either be a sequence of non-parentheses, or a re-
       cursive match of	the pattern itself (that is, a correctly parenthesized
       substring).  Finally there is a closing parenthesis. Note the use of  a
       possessive  quantifier  to  avoid  backtracking	into sequences of non-
       parentheses.

       If this were part of a larger pattern, you would	not  want  to  recurse
       the entire pattern, so instead you could	use this:

	 ( \( (	[^()]++	| (?1) )* \) )

       We  have	 put the pattern into parentheses, and caused the recursion to
       refer to	them instead of	the whole pattern.

       In a larger pattern,  keeping  track  of	 parenthesis  numbers  can  be
       tricky.	This is	made easier by the use of relative references. Instead
       of (?1) in the pattern above you	can write (?-2)	to refer to the	second
       most recently opened parentheses	 preceding  the	 recursion.  In	 other
       words,  a  negative  number counts capturing parentheses	leftwards from
       the point at which it is	encountered.

       Be aware	however, that if duplicate capture group numbers are  in  use,
       relative	 references  refer  to the earliest group with the appropriate
       number. Consider, for example:

	 (?|(a)|(b)) (c) (?-2)

       The first two capture groups (a)	and (b)	are both numbered 1, and group
       (c) is number 2.	When the reference (?-2) is  encountered,  the	second
       most  recently opened parentheses has the number	1, but it is the first
       such group (the (a) group) to which the recursion refers. This would be
       the same	if an absolute reference (?1) was used.	In other words,	 rela-
       tive references are just	a shorthand for	computing a group number.

       It  is  also possible to	refer to subsequent capture groups, by writing
       references such as (?+2). However, these	cannot	be  recursive  because
       the  reference  is not inside the parentheses that are referenced. They
       are always non-recursive	subroutine calls, as  described	 in  the  next
       section.

       An  alternative	approach  is to	use named parentheses. The Perl	syntax
       for this	is (?&name); PCRE1's earlier syntax  (?P>name)	is  also  sup-
       ported. We could	rewrite	the above example as follows:

	 (?<pn>	\( ( [^()]++ | (?&pn) )* \) )

       If there	is more	than one group with the	same name, the earliest	one is
       used.

       The example pattern that	we have	been looking at	contains nested	unlim-
       ited  repeats,  and  so the use of a possessive quantifier for matching
       strings of non-parentheses is important when applying  the  pattern  to
       strings that do not match. For example, when this pattern is applied to

	 (aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa()

       it  yields  "no	match" quickly.	However, if a possessive quantifier is
       not used, the match runs	for a very long	time indeed because there  are
       so  many	 different  ways the + and * repeats can carve up the subject,
       and all have to be tested before	failure	can be reported.

       At the end of a match, the values of capturing  parentheses  are	 those
       from  the outermost level. If you want to obtain	intermediate values, a
       callout function	can be used (see below and the pcre2callout documenta-
       tion). If the pattern above is matched against

	 (ab(cd)ef)

       the value for the inner capturing parentheses  (numbered	 2)  is	 "ef",
       which  is  the last value taken on at the top level. If a capture group
       is not matched at the top level,	its final  captured  value  is	unset,
       even  if	it was (temporarily) set at a deeper level during the matching
       process.

       Do not confuse the (?R) item with the condition (R),  which  tests  for
       recursion.   Consider  this pattern, which matches text in angle	brack-
       ets, allowing for arbitrary nesting. Only digits	are allowed in	nested
       brackets	 (that is, when	recursing), whereas any	characters are permit-
       ted at the outer	level.

	 < (?: (?(R) \d++  | [^<>]*+) |	(?R)) *	>

       In this pattern,	(?(R) is the start of a	conditional  group,  with  two
       different  alternatives	for the	recursive and non-recursive cases. The
       (?R) item is the	actual recursive call.

   Differences in recursion processing between PCRE2 and Perl

       Some former differences between PCRE2 and Perl no longer	exist.

       Before release 10.30, recursion processing in PCRE2 differed from  Perl
       in  that	 a  recursive  subroutine call was always treated as an	atomic
       group. That is, once it had matched some	of the subject string, it  was
       never  re-entered,  even	if it contained	untried	alternatives and there
       was a subsequent	matching failure. (Historical note:  PCRE  implemented
       recursion before	Perl did.)

       Starting	 with  release 10.30, recursive	subroutine calls are no	longer
       treated as atomic. That is, they	can be re-entered to try unused	alter-
       natives if there	is a matching failure later in the  pattern.  This  is
       now  compatible	with the way Perl works. If you	want a subroutine call
       to be atomic, you must explicitly enclose it in an atomic group.

       Supporting backtracking into recursions simplifies certain types	of re-
       cursive pattern.	For example, this pattern matches palindromic strings:

	 ^((.)(?1)\2|.?)$

       The second branch in the	group matches a	single	central	 character  in
       the  palindrome	when there are an odd number of	characters, or nothing
       when there are an even number of	characters, but	in order  to  work  it
       has  to	be  able  to  try the second case when the rest	of the pattern
       match fails. If you want	to match typical palindromic phrases, the pat-
       tern has	to ignore all non-word characters,  which  can	be  done  like
       this:

	 ^\W*+((.)\W*+(?1)\W*+\2|\W*+.?)\W*+$

       If  run	with  the  PCRE2_CASELESS option, this pattern matches phrases
       such as "A man, a plan, a canal:	Panama!". Note the use of the  posses-
       sive  quantifier	 *+  to	 avoid backtracking into sequences of non-word
       characters. Without this, PCRE2 takes a great deal longer (ten times or
       more) to	match typical phrases, and Perl	takes so long that  you	 think
       it has gone into	a loop.

       Another	way  in	which PCRE2 and	Perl used to differ in their recursion
       processing is in	the handling of	captured  values.  Formerly  in	 Perl,
       when  a	group  was called recursively or as a subroutine (see the next
       section), it had	no access to any values	that were captured outside the
       recursion, whereas in PCRE2 these values	can  be	 referenced.  Consider
       this pattern:

	 ^(.)(\1|a(?2))

       This  pattern matches "bab". The	first capturing	parentheses match "b",
       then in the second group, when the backreference	\1 fails to match "b",
       the second alternative matches "a" and then recurses. In	the recursion,
       \1 does now match "b" and so the	whole match succeeds. This match  used
       to fail in Perl,	but in later versions (I tried 5.024) it now works.

GROUPS AS SUBROUTINES

       If  the syntax for a recursive group call (either by number or by name)
       is used outside the parentheses to which	it refers, it operates	a  bit
       like  a	subroutine  in	a programming language.	More accurately, PCRE2
       treats the referenced group as an independent subpattern	which it tries
       to match	at the current matching	position. The called group may be  de-
       fined  before  or  after	the reference. A numbered reference can	be ab-
       solute or relative, as in these examples:

	 (...(absolute)...)...(?2)...
	 (...(relative)...)...(?-1)...
	 (...(?+1)...(relative)...

       An earlier example pointed out that the pattern

	 (sens|respons)e and \1ibility

       matches "sense and sensibility" and "response and responsibility",  but
       not "sense and responsibility". If instead the pattern

	 (sens|respons)e and (?1)ibility

       is  used, it does match "sense and responsibility" as well as the other
       two strings. Another example is	given  in  the	discussion  of	DEFINE
       above.

       Like  recursions,  subroutine  calls  used to be	treated	as atomic, but
       this changed at PCRE2 release 10.30, so	backtracking  into  subroutine
       calls  can  now	occur. However,	any capturing parentheses that are set
       during the subroutine call revert to their previous values afterwards.

       Processing options such as case-independence are	fixed when a group  is
       defined,	 so  if	 it  is	 used  as a subroutine,	such options cannot be
       changed for different calls. For	example, consider this pattern:

	 (abc)(?i:(?-1))

       It matches "abcabc". It does not	match "abcABC" because the  change  of
       processing option does not affect the called group.

       The  behaviour  of  backtracking	control	verbs in groups	when called as
       subroutines is described	in the section entitled	"Backtracking verbs in
       subroutines" below.

ONIGURUMA SUBROUTINE SYNTAX

       For compatibility with Oniguruma, the non-Perl syntax \g	followed by  a
       name or a number	enclosed either	in angle brackets or single quotes, is
       an alternative syntax for calling a group as a subroutine, possibly re-
       cursively.  Here	 are  two  of the examples used	above, rewritten using
       this syntax:

	 (?<pn>	\( ( (?>[^()]+)	| \g<pn> )* \) )
	 (sens|respons)e and \g'1'ibility

       PCRE2 supports an extension to Oniguruma: if a number is	preceded by  a
       plus or a minus sign it is taken	as a relative reference. For example:

	 (abc)(?i:\g<-1>)

       Note  that \g{...} (Perl	syntax)	and \g<...> (Oniguruma syntax) are not
       synonymous. The former is a backreference; the latter is	 a  subroutine
       call.

CALLOUTS

       Perl has	a feature whereby using	the sequence (?{...}) causes arbitrary
       Perl  code to be	obeyed in the middle of	matching a regular expression.
       This makes it possible, amongst other things, to	extract	different sub-
       strings that match the same pair	of parentheses when there is a repeti-
       tion.

       PCRE2 provides a	similar	feature, but of	course it  cannot  obey	 arbi-
       trary  Perl  code. The feature is called	"callout". The caller of PCRE2
       provides	an external function by	putting	its entry  point  in  a	 match
       context	using  the function pcre2_set_callout(), and then passing that
       context to pcre2_match()	or pcre2_dfa_match(). If no match  context  is
       passed,	or  if	the callout entry point	is set to NULL,	callout	points
       will be passed over silently during matching. To	disallow  callouts  in
       the pattern syntax, you may use the PCRE2_EXTRA_NEVER_CALLOUT option.

       Within  a  regular expression, (?C<arg>)	indicates a point at which the
       external	function is to be called. There	 are  two  kinds  of  callout:
       those  with a numerical argument	and those with a string	argument. (?C)
       on its own with no argument is treated as (?C0).	A  numerical  argument
       allows  the  application	 to  distinguish  between  different callouts.
       String arguments	were added for release 10.20 to	make it	 possible  for
       script  languages that use PCRE2	to embed short scripts within patterns
       in a similar way	to Perl.

       During matching,	when PCRE2 reaches a callout point, the	external func-
       tion is called. It is provided with the number or  string  argument  of
       the  callout, the position in the pattern, and one item of data that is
       also set	in the match block. The	callout	function may cause matching to
       proceed,	to backtrack, or to fail.

       By default, PCRE2 implements a  number  of  optimizations  at  matching
       time,  and  one	side-effect is that sometimes callouts are skipped. If
       you need	all possible callouts to happen, you need to set options  that
       disable	the relevant optimizations. More details, including a complete
       description of the programming interface	to the callout	function,  are
       given in	the pcre2callout documentation.

   Callouts with numerical arguments

       If  you	just  want  to	have  a	means of identifying different callout
       points, put a number less than 256 after	the  letter  C.	 For  example,
       this pattern has	two callout points:

	 (?C1)abc(?C2)def

       If  the PCRE2_AUTO_CALLOUT flag is passed to pcre2_compile(), numerical
       callouts	are automatically installed before each	item in	 the  pattern.
       They  are all numbered 255. If there is a conditional group in the pat-
       tern whose condition is an assertion, an	additional callout is inserted
       just before the condition. An explicit callout may also be set at  this
       position, as in this example:

	 (?(?C9)(?=a)abc|def)

       Note that this applies only to assertion	conditions, not	to other types
       of condition.

   Callouts with string	arguments

       A  delimited  string may	be used	instead	of a number as a callout argu-
       ment. The starting delimiter must be one	of ` ' " ^ % #	$  {  and  the
       ending delimiter	is the same as the start, except for {,	where the end-
       ing  delimiter  is  }.  If  the	ending	delimiter is needed within the
       string, it must be doubled. For example:

	 (?C'ab	''c'' d')xyz(?C{any text})pqr

       The doubling is removed before the string  is  passed  to  the  callout
       function.

BACKTRACKING CONTROL

       There  are  a  number  of  special "Backtracking	Control	Verbs" (to use
       Perl's terminology) that	modify the behaviour  of  backtracking	during
       matching.  They are generally of	the form (*VERB) or (*VERB:NAME). Some
       verbs take either form, and may behave differently depending on whether
       or not a	name argument is present. The names are	 not  required	to  be
       unique within the pattern.

       By  default,  for  compatibility	 with  Perl, a name is any sequence of
       characters that does not	include	a closing parenthesis. The name	is not
       processed in any	way, and it is	not  possible  to  include  a  closing
       parenthesis   in	 the  name.   This  can	 be  changed  by  setting  the
       PCRE2_ALT_VERBNAMES option, but the result is no	 longer	 Perl-compati-
       ble.

       When  PCRE2_ALT_VERBNAMES  is  set,  backslash processing is applied to
       verb names and only an unescaped	 closing  parenthesis  terminates  the
       name.  However, the only	backslash items	that are permitted are \Q, \E,
       and sequences such as \x{100} that define character code	points.	 Char-
       acter type escapes such as \d are faulted.

       A closing parenthesis can be included in	a name either as \) or between
       \Q  and	\E. In addition	to backslash processing, if the	PCRE2_EXTENDED
       or PCRE2_EXTENDED_MORE option is	also set, unescaped whitespace in verb
       names is	skipped, and #-comments	are recognized,	exactly	as in the rest
       of the pattern.	PCRE2_EXTENDED and PCRE2_EXTENDED_MORE do  not	affect
       verb names unless PCRE2_ALT_VERBNAMES is	also set.

       The  maximum  length of a name is 255 in	the 8-bit library and 65535 in
       the 16-bit and 32-bit libraries.	If the name is empty, that is, if  the
       closing	parenthesis immediately	follows	the colon, the effect is as if
       the colon were not there. Any number of these verbs may occur in	a pat-
       tern. Except for	(*ACCEPT), they	may not	be quantified.

       Since these verbs are specifically related  to  backtracking,  most  of
       them  can be used only when the pattern is to be	matched	using the tra-
       ditional	matching function or JIT, because they use backtracking	 algo-
       rithms.	With  the  exception  of (*FAIL), which	behaves	like a failing
       negative	assertion, the backtracking control verbs cause	 an  error  if
       encountered by the DFA matching function.

       The  behaviour  of  these  verbs	in repeated groups, assertions,	and in
       capture groups called as	subroutines (whether or	 not  recursively)  is
       documented below.

   Optimizations that affect backtracking verbs

       PCRE2 contains some optimizations that are used to speed	up matching by
       running some checks at the start	of each	match attempt. For example, it
       may  know  the minimum length of	matching subject, or that a particular
       character must be present. When one of these optimizations bypasses the
       running of a match,  any	 included  backtracking	 verbs	will  not,  of
       course, be processed. You can suppress the start-of-match optimizations
       by  setting  the	PCRE2_NO_START_OPTIMIZE	option when calling pcre2_com-
       pile(), by calling pcre2_set_optimize() with a PCRE2_START_OPTIMIZE_OFF
       directive, or by	starting the pattern with  (*NO_START_OPT).  There  is
       more  discussion	 of  this  option in the section entitled "Compiling a
       pattern"	in the pcre2api	documentation.

       Experiments with	Perl suggest that it too  has  similar	optimizations,
       and like	PCRE2, turning them off	can change the result of a match.

   Verbs that act immediately

       The following verbs act as soon as they are encountered.

	  (*ACCEPT) or (*ACCEPT:NAME)

       This  verb causes the match to end successfully,	skipping the remainder
       of the pattern. However,	when it	is inside  a  capture  group  that  is
       called as a subroutine, only that group is ended	successfully. Matching
       then continues at the outer level. If (*ACCEPT) in triggered in a posi-
       tive  assertion,	 the  assertion	succeeds; in a negative	assertion, the
       assertion fails.

       If (*ACCEPT) is inside capturing	parentheses, the data so far  is  cap-
       tured. For example:

	 A((?:A|B(*ACCEPT)|C)D)

       This  matches  "AB", "AAD", or "ACD"; when it matches "AB", "B" is cap-
       tured by	the outer parentheses.

       (*ACCEPT) is the	only backtracking verb that is allowed to  be  quanti-
       fied  because  an  ungreedy  quantification with	a minimum of zero acts
       only when a backtrack happens. Consider,	for example,

	 (A(*ACCEPT)??B)C

       where A,	B, and C may be	complex	expressions. After matching  "A",  the
       matcher	processes  "BC"; if that fails,	causing	a backtrack, (*ACCEPT)
       is triggered and	the match succeeds. In both cases, all but C  is  cap-
       tured.  Whereas	(*COMMIT) (see below) means "fail on backtrack", a re-
       peated (*ACCEPT)	of this	type means "succeed on backtrack".

       Warning:	(*ACCEPT) should not be	used within a script  run  group,  be-
       cause  it causes	an immediate exit from the group, bypassing the	script
       run checking.

	 (*FAIL) or (*FAIL:NAME)

       This verb causes	a matching failure, forcing backtracking to occur.  It
       may  be	abbreviated  to	 (*F).	It is equivalent to (?!) but easier to
       read. The Perl documentation notes that it is probably useful only when
       combined	with (?{}) or (??{}). Those are, of course, Perl features that
       are not present in PCRE2. The nearest equivalent	is  the	 callout  fea-
       ture, as	for example in this pattern:

	 a+(?C)(*FAIL)

       A  match	 with the string "aaaa"	always fails, but the callout is taken
       before each backtrack happens (in this example, 10 times).

       (*ACCEPT:NAME) and (*FAIL:NAME) behave the  same	 as  (*MARK:NAME)(*AC-
       CEPT)  and  (*MARK:NAME)(*FAIL),	 respectively,	that  is, a (*MARK) is
       recorded	just before the	verb acts.

   Recording which path	was taken

       There is	one verb whose main purpose is to track	how a  match  was  ar-
       rived  at,  though  it also has a secondary use in conjunction with ad-
       vancing the match starting point	(see (*SKIP) below).

	 (*MARK:NAME) or (*:NAME)

       A name is always	required with this verb. For all the other  backtrack-
       ing control verbs, a NAME argument is optional.

       When  a	match  succeeds, the name of the last-encountered mark name on
       the matching path is passed back	to the caller as described in the sec-
       tion entitled "Other information	about the match" in the	pcre2api docu-
       mentation. This applies to all instances	of (*MARK)  and	 other	verbs,
       including those inside assertions and atomic groups. However, there are
       differences  in	those  cases  when (*MARK) is used in conjunction with
       (*SKIP) as described below.

       The mark	name that was last encountered on the matching path is	passed
       back.  A	verb without a NAME argument is	ignored	for this purpose. Here
       is an example of	pcre2test output, where	the "mark"  modifier  requests
       the retrieval and outputting of (*MARK) data:

	   re> /X(*MARK:A)Y|X(*MARK:B)Z/mark
	 data> XY
	  0: XY
	 MK: A
	 XZ
	  0: XZ
	 MK: B

       The (*MARK) name	is tagged with "MK:" in	this output, and in this exam-
       ple  it indicates which of the two alternatives matched.	This is	a more
       efficient way of	obtaining this information than	putting	each  alterna-
       tive in its own capturing parentheses.

       If  a  verb  with a name	is encountered in a positive assertion that is
       true, the name is recorded and passed back if it	 is  the  last-encoun-
       tered. This does	not happen for negative	assertions or failing positive
       assertions.

       After  a	 partial match or a failed match, the last encountered name in
       the entire match	process	is returned. For example:

	   re> /X(*MARK:A)Y|X(*MARK:B)Z/mark
	 data> XP
	 No match, mark	= B

       Note that in this unanchored example the	 mark  is  retained  from  the
       match attempt that started at the letter	"X" in the subject. Subsequent
       match attempts starting at "P" and then with an empty string do not get
       as far as the (*MARK) item, but nevertheless do not reset it.

       If  you	are  interested	 in  (*MARK)  values after failed matches, you
       should probably either set the PCRE2_NO_START_OPTIMIZE option  or  call
       pcre2_set_optimize()  with  a  PCRE2_START_OPTIMIZE_OFF	directive (see
       above) to ensure	that the match is always attempted.

   Verbs that act after	backtracking

       The following verbs do nothing when they	are encountered. Matching con-
       tinues with what	follows, but if	there is a subsequent  match  failure,
       causing	a  backtrack  to the verb, a failure is	forced.	That is, back-
       tracking	cannot pass to the left	of the	verb.  However,	 when  one  of
       these  verbs  appears inside an atomic group or in an atomic lookaround
       assertion that is true, its effect is confined to that  group,  because
       once  the  group	has been matched, there	is never any backtracking into
       it. Backtracking	from beyond an atomic assertion	or group  ignores  the
       entire group, and seeks a preceding backtracking	point.

       These  verbs  differ  in	exactly	what kind of failure occurs when back-
       tracking	reaches	them. The behaviour described below  is	 what  happens
       when  the  verb is not in a subroutine or an assertion. Subsequent sec-
       tions cover these special cases.

	 (*COMMIT) or (*COMMIT:NAME)

       This verb causes	the whole match	to fail	outright if there is  a	 later
       matching	failure	that causes backtracking to reach it. Even if the pat-
       tern  is	 unanchored,  no further attempts to find a match by advancing
       the starting point take place. If (*COMMIT) is  the  only  backtracking
       verb that is encountered, once it has been passed pcre2_match() is com-
       mitted to finding a match at the	current	starting point,	or not at all.
       For example:

	 a+(*COMMIT)b

       This  matches  "xxaab" but not "aacaab".	It can be thought of as	a kind
       of dynamic anchor, or "I've started, so I must finish."

       The behaviour of	(*COMMIT:NAME) is not the same	as  (*MARK:NAME)(*COM-
       MIT).  It is like (*MARK:NAME) in that the name is remembered for pass-
       ing back	to the caller. However,	(*SKIP:NAME) searches only  for	 names
       that are	set with (*MARK), ignoring those set by	any of the other back-
       tracking	verbs.

       If  there  is more than one backtracking	verb in	a pattern, a different
       one that	follows	(*COMMIT) may be triggered first,  so  merely  passing
       (*COMMIT) during	a match	does not always	guarantee that a match must be
       at this starting	point.

       Note that (*COMMIT) at the start	of a pattern is	not the	same as	an an-
       chor,  unless  PCRE2's  start-of-match optimizations are	turned off, as
       shown in	this output from pcre2test:

	   re> /(*COMMIT)abc/
	 data> xyzabc
	  0: abc
	 data>
	 re> /(*COMMIT)abc/no_start_optimize
	 data> xyzabc
	 No match

       For the first pattern, PCRE2 knows that any match must start with  "a",
       so  the optimization skips along	the subject to "a" before applying the
       pattern to the first set	of data. The match attempt then	succeeds.  The
       second  pattern disables	the optimization that skips along to the first
       character. The pattern is now applied  starting	at  "x",  and  so  the
       (*COMMIT)  causes  the  match to	fail without trying any	other starting
       points.

	 (*PRUNE) or (*PRUNE:NAME)

       This verb causes	the match to fail at the current starting position  in
       the subject if there is a later matching	failure	that causes backtrack-
       ing  to	reach it. If the pattern is unanchored,	the normal "bumpalong"
       advance to the next starting character then happens.  Backtracking  can
       occur  as  usual	to the left of (*PRUNE), before	it is reached, or when
       matching	to the right of	(*PRUNE), but if there	is  no	match  to  the
       right,  backtracking cannot cross (*PRUNE). In simple cases, the	use of
       (*PRUNE)	is just	an alternative to an atomic group or possessive	 quan-
       tifier, but there are some uses of (*PRUNE) that	cannot be expressed in
       any  other  way.	In an anchored pattern (*PRUNE)	has the	same effect as
       (*COMMIT).

       The behaviour of	(*PRUNE:NAME) is not the same as (*MARK:NAME)(*PRUNE).
       It is like (*MARK:NAME) in that the name	is remembered for passing back
       to the caller. However, (*SKIP:NAME) searches only for names  set  with
       (*MARK),	ignoring those set by other backtracking verbs.

	 (*SKIP)

       This  verb, when	given without a	name, is like (*PRUNE),	except that if
       the pattern is unanchored, the "bumpalong" advance is not to  the  next
       character, but to the position in the subject where (*SKIP) was encoun-
       tered.  (*SKIP)	signifies that whatever	text was matched leading up to
       it cannot be part of a successful match if there	is a  later  mismatch.
       Consider:

	 a+(*SKIP)b

       If  the	subject	 is  "aaaac...",  after	 the first match attempt fails
       (starting at the	first character	in the	string),  the  starting	 point
       skips on	to start the next attempt at "c". Note that a possessive quan-
       tifier does not have the	same effect as this example; although it would
       suppress	 backtracking  during  the first match attempt,	the second at-
       tempt would start at the	second character instead  of  skipping	on  to
       "c".

       If  (*SKIP) is used to specify a	new starting position that is the same
       as the starting position	of the current match, or (by  being  inside  a
       lookbehind)  earlier, the position specified by (*SKIP) is ignored, and
       instead the normal "bumpalong" occurs.

	 (*SKIP:NAME)

       When (*SKIP) has	an associated name, its	behaviour  is  modified.  When
       such  a	(*SKIP)	is triggered, the previous path	through	the pattern is
       searched	for the	most recent (*MARK) that has the same name. If one  is
       found,  the  "bumpalong"	advance	is to the subject position that	corre-
       sponds to that (*MARK) instead of to where (*SKIP) was encountered.  If
       no (*MARK) with a matching name is found, the (*SKIP) is	ignored.

       The  search  for	a (*MARK) name uses the	normal backtracking mechanism,
       which means that	it does	not  see  (*MARK)  settings  that  are	inside
       atomic groups or	assertions, because they are never re-entered by back-
       tracking. Compare the following pcre2test examples:

	   re> /a(?>(*MARK:X))(*SKIP:X)(*F)|(.)/
	 data: abc
	  0: a
	  1: a
	 data:
	   re> /a(?:(*MARK:X))(*SKIP:X)(*F)|(.)/
	 data: abc
	  0: b
	  1: b

       In  the first example, the (*MARK) setting is in	an atomic group, so it
       is not seen when	(*SKIP:X) triggers, causing the	(*SKIP)	to be ignored.
       This allows the second branch of	the pattern to be tried	at  the	 first
       character  position.  In	the second example, the	(*MARK)	setting	is not
       in an atomic group. This	allows (*SKIP:X) to find the (*MARK)  when  it
       backtracks, and this causes a new matching attempt to start at the sec-
       ond  character.	This  time, the	(*MARK)	is never seen because "a" does
       not match "b", so the matcher immediately jumps to the second branch of
       the pattern.

       Note that (*SKIP:NAME) searches only for	names set by (*MARK:NAME).  It
       ignores names that are set by other backtracking	verbs.

	 (*THEN) or (*THEN:NAME)

       This  verb  causes  a skip to the next innermost	alternative when back-
       tracking	reaches	it. That  is,  it  cancels  any	 further  backtracking
       within  the  current  alternative.  Its name comes from the observation
       that it can be used for a pattern-based if-then-else block:

	 ( COND1 (*THEN) FOO | COND2 (*THEN) BAR | COND3 (*THEN) BAZ ) ...

       If the COND1 pattern matches, FOO is tried (and possibly	further	 items
       after  the  end	of the group if	FOO succeeds); on failure, the matcher
       skips to	the second alternative and tries COND2,	 without  backtracking
       into  COND1.  If	that succeeds and BAR fails, COND3 is tried. If	subse-
       quently BAZ fails, there	are no more alternatives, so there is a	 back-
       track  to  whatever came	before the entire group. If (*THEN) is not in-
       side an alternation, it acts like (*PRUNE).

       The behaviour of	(*THEN:NAME) is	not the	same  as  (*MARK:NAME)(*THEN).
       It is like (*MARK:NAME) in that the name	is remembered for passing back
       to  the	caller.	However, (*SKIP:NAME) searches only for	names set with
       (*MARK),	ignoring those set by other backtracking verbs.

       A group that does not contain a | character is just a part of  the  en-
       closing	alternative;  it is not	a nested alternation with only one al-
       ternative. The effect of	(*THEN)	extends	beyond such a group to the en-
       closing alternative.  Consider this pattern, where A, B,	etc. are  com-
       plex  pattern  fragments	 that  do not contain any | characters at this
       level:

	 A (B(*THEN)C) | D

       If A and	B are matched, but there is a failure in C, matching does  not
       backtrack into A; instead it moves to the next alternative, that	is, D.
       However,	 if  the  group	containing (*THEN) is given an alternative, it
       behaves differently:

	 A (B(*THEN)C |	(*FAIL)) | D

       The effect of (*THEN) is	now confined to	the inner group. After a fail-
       ure in C, matching moves	to (*FAIL), which causes the  whole  group  to
       fail  because  there  are  no  more  alternatives to try. In this case,
       matching	does backtrack into A.

       Note that a conditional group is	not considered as having two  alterna-
       tives,  because	only one is ever used. In other	words, the | character
       in a conditional	group has a different meaning. Ignoring	 white	space,
       consider:

	 ^.*? (?(?=a) a	| b(*THEN)c )

       If the subject is "ba", this pattern does not match. Because .*?	is un-
       greedy,	it initially matches zero characters. The condition (?=a) then
       fails, the character "b"	is matched, but	"c" is	not.  At  this	point,
       matching	 does  not  backtrack to .*? as	might perhaps be expected from
       the presence of the | character.	The conditional	group is part  of  the
       single  alternative  that comprises the whole pattern, and so the match
       fails. (If there	was a backtrack	into .*?, allowing it  to  match  "b",
       the match would succeed.)

       The  verbs just described provide four different	"strengths" of control
       when subsequent matching	fails. (*THEN) is the weakest, carrying	on the
       match at	the next alternative. (*PRUNE) comes next, failing  the	 match
       at  the	current	starting position, but allowing	an advance to the next
       character (for an unanchored pattern). (*SKIP) is similar, except  that
       the advance may be more than one	character. (*COMMIT) is	the strongest,
       causing the entire match	to fail.

   More	than one backtracking verb

       If  more	 than  one  backtracking verb is present in a pattern, the one
       that is backtracked onto	first acts. For	example,  consider  this  pat-
       tern, where A, B, etc. are complex pattern fragments:

	 (A(*COMMIT)B(*THEN)C|ABD)

       If  A matches but B fails, the backtrack	to (*COMMIT) causes the	entire
       match to	fail. However, if A and	B match, but C fails, the backtrack to
       (*THEN) causes the next alternative (ABD) to be tried.  This  behaviour
       is  consistent,	but is not always the same as Perl's. It means that if
       two or more backtracking	verbs appear in	succession, all	but  the  last
       of them has no effect. Consider this example:

	 ...(*COMMIT)(*PRUNE)...

       If there	is a matching failure to the right, backtracking onto (*PRUNE)
       causes  it to be	triggered, and its action is taken. There can never be
       a backtrack onto	(*COMMIT).

   Backtracking	verbs in repeated groups

       PCRE2 sometimes differs from Perl in its	handling of backtracking verbs
       in repeated groups. For example,	consider:

	 /(a(*COMMIT)b)+ac/

       If the subject is "abac", Perl matches  unless  its  optimizations  are
       disabled,  but  PCRE2  always fails because the (*COMMIT) in the	second
       repeat of the group acts.

   Backtracking	verbs in assertions

       (*FAIL) in any assertion	has its	normal effect: it forces an  immediate
       backtrack.  The	behaviour  of  the other backtracking verbs depends on
       whether or not the assertion is standalone or acting as	the  condition
       in a conditional	group.

       (*ACCEPT)  in  a	 standalone positive assertion causes the assertion to
       succeed without any further processing; captured	 strings  and  a  mark
       name  (if  set) are retained. In	a standalone negative assertion, (*AC-
       CEPT) causes the	assertion to fail without any further processing; cap-
       tured substrings	and any	mark name are discarded.

       If the assertion	is a condition,	(*ACCEPT) causes the condition	to  be
       true  for  a  positive assertion	and false for a	negative one; captured
       substrings are retained in both cases.

       The remaining verbs act only when a later failure causes	a backtrack to
       reach them. This	means that, for	the Perl-compatible assertions,	 their
       effect is confined to the assertion, because Perl lookaround assertions
       are atomic. A backtrack that occurs after such an assertion is complete
       does  not  jump	back  into  the	 assertion.  Note in particular	that a
       (*MARK) name that is set	in an assertion	is not "seen" by  an  instance
       of (*SKIP:NAME) later in	the pattern.

       PCRE2  now  supports non-atomic positive	assertions and also "scan sub-
       string" assertions, as described	in the sections	 entitled  "Non-atomic
       assertions"  and	 "Scan	substring  assertions" above. These assertions
       must be standalone (not used as conditions). They are not Perl-compati-
       ble. For	these assertions, a later backtrack does jump  back  into  the
       assertion,  and	therefore  verbs such as (*COMMIT) can be triggered by
       backtracks from later in	the pattern.

       The effect of (*THEN) is	not allowed to escape beyond an	assertion.  If
       there  are no more branches to try, (*THEN) causes a positive assertion
       to be false, and	a negative assertion to	be true. This  behaviour  dif-
       fers from Perl when the assertion has only one branch.

       The  other  backtracking	verbs are not treated specially	if they	appear
       in a standalone positive	assertion. In a	 conditional  positive	asser-
       tion, backtracking (from	within the assertion) into (*COMMIT), (*SKIP),
       or  (*PRUNE) causes the condition to be false. However, for both	stand-
       alone and conditional negative assertions, backtracking into (*COMMIT),
       (*SKIP),	or (*PRUNE) causes the assertion to be true, without consider-
       ing any further alternative branches.

   Backtracking	verbs in subroutines

       These behaviours	occur whether or not the group is called recursively.

       (*ACCEPT) in a group called as a	subroutine causes the subroutine match
       to succeed without any further processing. Matching then	continues  af-
       ter  the	 subroutine call. Perl documents this behaviour. Perl's	treat-
       ment of the other verbs in subroutines is different in some cases.

       (*FAIL) in a group called as a subroutine has  its  normal  effect:  it
       forces an immediate backtrack.

       (*COMMIT),  (*SKIP),  and  (*PRUNE)  cause the subroutine match to fail
       when triggered by being backtracked to in a group called	as  a  subrou-
       tine. There is then a backtrack at the outer level.

       (*THEN),	when triggered,	skips to the next alternative in the innermost
       enclosing  group	that has alternatives (its normal behaviour). However,
       if there	is no such group within	the subroutine's group,	the subroutine
       match fails and there is	a backtrack at the outer level.

EBCDIC ENVIRONMENTS

       Differences in the way PCRE behaves when	it is running in an EBCDIC en-
       vironment are covered in	this section.

   Escape sequences

       When PCRE2 is compiled in EBCDIC	mode, \N{U+hhh..}  is  not  supported.
       \a, \e, \f, \n, \r, and \t generate the appropriate EBCDIC code values.
       The \c escape is	processed as specified for Perl	in the perlebcdic doc-
       ument.  The  only characters that are allowed after \c are A-Z, a-z, or
       one of @, [, \, ], ^, _,	or ?. Any other	character provokes a  compile-
       time  error.  The  sequence  \c@	encodes	character code 0; after	\c the
       letters (in either case)	encode characters 1-26 (hex 01 to hex 1A);  [,
       \,  ], ^, and _ encode characters 27-31 (hex 1B to hex 1F), and \c? be-
       comes either 255	(hex FF) or 95 (hex 5F).

       Thus, apart from	\c?, these escapes generate the	 same  character  code
       values  as they do in an	ASCII or Unicode environment, though the mean-
       ings of the values mostly differ. For  example,	\cG  always  generates
       code value 7, which is BEL in ASCII but DEL in EBCDIC.

       The  sequence  \c? generates DEL	(127, hex 7F) in an ASCII environment,
       but because 127 is not a	control	character in  EBCDIC,  Perl  makes  it
       generate	 the  APC character. Unfortunately, there are several variants
       of EBCDIC. In most of them the APC character has	 the  value  255  (hex
       FF),  but  in  the one Perl calls POSIX-BC its value is 95 (hex 5F). If
       certain other characters	have POSIX-BC values, PCRE2 makes \c? generate
       95; otherwise it	generates 255.

   Character classes

       In character classes there is a special case in EBCDIC environments for
       ranges whose end	points are both	specified as literal  letters  in  the
       same  case.  For	compatibility with Perl, EBCDIC	code points within the
       range that are not letters are omitted. For example, [h-k] matches only
       four characters,	even though the	EBCDIC codes for h and k are 0x88  and
       0x92, a range of	11 code	points.	However, if the	range is specified nu-
       merically,  for	example,  [\x88-\x92] or [h-\x92], all code points are
       included.

SEE ALSO

       pcre2api(3),   pcre2callout(3),	  pcre2matching(3),    pcre2syntax(3),
       pcre2(3).

AUTHOR

       Philip Hazel
       Retired from University Computing Service
       Cambridge, England.

REVISION

       Last updated: 27	November 2024
       Copyright (c) 1997-2024 University of Cambridge.

PCRE2 10.45		       27 November 2024		       PCRE2PATTERN(3)

Want to link to this manual page? Use this URL:
<https://man.freebsd.org/cgi/man.cgi?query=pcre2pattern&sektion=3&manpath=FreeBSD+Ports+14.3.quarterly>

home | help