FreeBSD Manual Pages

home | help
re(3)			   Erlang Module Definition			 re(3)

NAME
       re - Perl-like regular expressions for Erlang.

DESCRIPTION
       This  module contains regular expression	matching functions for strings
       and binaries.

       The regular expression syntax and semantics resemble that of Perl.

       The matching algorithms of the library are based	on the	PCRE  library,
       but not all of the PCRE library is interfaced and some parts of the li-
       brary  go beyond	what PCRE offers. Currently PCRE version 8.40 (release
       date 2017-01-11)	is used. The sections of the PCRE  documentation  that
       are relevant to this module are included	here.

   Note:
       The  Erlang literal syntax for strings uses the "\" (backslash) charac-
       ter as an escape	code.  You  need  to  escape  backslashes  in  literal
       strings,	 both  in your code and	in the shell, with an extra backslash,
       that is,	"\\".

DATA TYPES
       mp() = {re_pattern, term(), term(), term(), term()}

	      Opaque data type containing a compiled regular expression.  mp()
	      is  guaranteed to	be a tuple() having the	atom re_pattern	as its
	      first element, to	allow for matching in guards. The arity	of the
	      tuple or the content of the other	fields can  change  in	future
	      Erlang/OTP releases.

       nl_spec() = cr |	crlf | lf | anycrlf | any

       compile_option()	=
	   unicode | anchored |	caseless | dollar_endonly | dotall |
	   extended | firstline	| multiline | no_auto_capture |
	   dupnames | ungreedy |
	   {newline, nl_spec()}	|
	   bsr_anycrlf | bsr_unicode | no_start_optimize | ucp |
	   never_utf

       replace_fun() =
	   fun((binary(), [binary()]) -> iodata() | unicode:charlist())

EXPORTS
       version() -> binary()

	      The return of this function is a string with the PCRE version of
	      the system that was used in the Erlang/OTP compilation.

       compile(Regexp) -> {ok, MP} | {error, ErrSpec}

	      Types:

		 Regexp	= iodata()
		 MP = mp()
		 ErrSpec =
		     {ErrString	:: string(), Position :: integer() >= 0}

	      The same as compile(Regexp,[])

       compile(Regexp, Options)	-> {ok,	MP} | {error, ErrSpec}

	      Types:

		 Regexp	= iodata() | unicode:charlist()
		 Options = [Option]
		 Option	= compile_option()
		 MP = mp()
		 ErrSpec =
		     {ErrString	:: string(), Position :: integer() >= 0}

	      Compiles	a regular expression, with the syntax described	below,
	      into an internal format to be used later as a parameter to run/2
	      and run/3.

	      Compiling	the regular expression before matching	is  useful  if
	      the  same	 expression is to be used in matching against multiple
	      subjects during the lifetime of the program. Compiling once  and
	      executing	 many  times is	far more efficient than	compiling each
	      time one wants to	match.

	      When option unicode is specified,	the regular expression	is  to
	      be  specified  as	 a  valid Unicode charlist(), otherwise	as any
	      valid iodata().

	      Options:

		unicode:
		  The regular expression is specified as a Unicode  charlist()
		  and  the  resulting  regular	expression  code  is to	be run
		  against a valid Unicode charlist()  subject.	Also  consider
		  option ucp when using	Unicode	characters.

		anchored:
		  The  pattern is forced to be "anchored", that	is, it is con-
		  strained to match only at the	first matching	point  in  the
		  string  that is searched (the	"subject string"). This	effect
		  can also be achieved by appropriate constructs in  the  pat-
		  tern itself.

		caseless:
		  Letters  in  the  pattern match both uppercase and lowercase
		  letters. It is equivalent to	Perl  option  /i  and  can  be
		  changed within a pattern by a	(?i) option setting. Uppercase
		  and lowercase	letters	are defined as in the ISO 8859-1 char-
		  acter	set.

		dollar_endonly:
		  A  dollar  metacharacter  in the pattern matches only	at the
		  end of the subject string. Without  this  option,  a	dollar
		  also	matches	immediately before a newline at	the end	of the
		  string (but not before any other newlines). This  option  is
		  ignored if option multiline is specified. There is no	equiv-
		  alent	option in Perl,	and it cannot be set within a pattern.

		dotall:
		  A dot	in the pattern matches all characters, including those
		  indicating  newline.	Without	 it, a dot does	not match when
		  the current position is at a newline.	This option is equiva-
		  lent to Perl option /s and it	can be changed within  a  pat-
		  tern	by  a  (?s)  option setting. A negative	class, such as
		  [^a],	always matches newline characters, independent of  the
		  setting of this option.

		extended:
		  If  this  option  is set, most white space characters	in the
		  pattern are totally ignored except when escaped or inside  a
		  character  class. However, white space is not	allowed	within
		  sequences such as (?>	that introduce	various	 parenthesized
		  subpatterns,	nor  within  a	numerical  quantifier  such as
		  {1,3}. However, ignorable white space	is  permitted  between
		  an  item and a following quantifier and between a quantifier
		  and a	following + that indicates possessiveness.

		  White	space did not used to include the VT  character	 (code
		  11),	because	 Perl  did  not	 treat this character as white
		  space. However, Perl changed at release 5.18,	so  PCRE  fol-
		  lowed	at release 8.34, and VT	is now treated as white	space.

		  This also causes characters between an unescaped # outside a
		  character  class  and	the next newline, inclusive, to	be ig-
		  nored. This is equivalent to Perl's /x option, and it	can be
		  changed within a pattern by a	(?x) option setting.

		  With this option, comments inside complicated	 patterns  can
		  be  included.	However, notice	that this applies only to data
		  characters. Whitespace characters can	 never	appear	within
		  special character sequences in a pattern, for	example	within
		  sequence (?( that introduces a conditional subpattern.

		firstline:
		  An  unanchored pattern is required to	match before or	at the
		  first	newline	in the subject string,	although  the  matched
		  text can continue over the newline.

		multiline:
		  By  default, PCRE treats the subject string as consisting of
		  a single line	of characters (even if it contains  newlines).
		  The  "start  of  line" metacharacter (^) matches only	at the
		  start	of the string, while the "end of  line"	 metacharacter
		  ($)  matches only at the end of the string, or before	a ter-
		  minating newline (unless  option  dollar_endonly  is	speci-
		  fied). This is the same as in	Perl.

		  When	this option is specified, the "start of	line" and "end
		  of line" constructs match immediately	following  or  immedi-
		  ately	 before	 internal  newlines in the subject string, re-
		  spectively, as well as at the	very start and	end.  This  is
		  equivalent  to  Perl	option	/m and can be changed within a
		  pattern by a (?m) option setting. If there are  no  newlines
		  in  a	 subject string, or no occurrences of ^	or $ in	a pat-
		  tern,	setting	multiline has no effect.

		no_auto_capture:
		  Disables the use of numbered capturing  parentheses  in  the
		  pattern.  Any	 opening parenthesis that is not followed by ?
		  behaves as if	it is followed by ?:.  Named  parentheses  can
		  still	be used	for capturing (and they	acquire	numbers	in the
		  usual	way). There is no equivalent option in Perl.

		dupnames:
		  Names	 used  to  identify  capturing subpatterns need	not be
		  unique. This can be helpful for  certain  types  of  pattern
		  when it is known that	only one instance of the named subpat-
		  tern	can ever be matched. More details of named subpatterns
		  are provided below.

		ungreedy:
		  Inverts the "greediness" of the quantifiers so that they are
		  not greedy by	default, but become greedy if followed by "?".
		  It is	not compatible with Perl. It can also be set by	a (?U)
		  option setting within	the pattern.

		{newline, NLSpec}:
		  Overrides the	default	definition of a	newline	in the subject
		  string, which	is LF (ASCII 10) in Erlang.

		  cr:
		    Newline is indicated by a single character cr (ASCII 13).

		  lf:
		    Newline is indicated by a single character LF (ASCII  10),
		    the	default.

		  crlf:
		    Newline  is	 indicated by the two-character	CRLF (ASCII 13
		    followed by	ASCII 10) sequence.

		  anycrlf:
		    Any	of the three preceding sequences is to be recognized.

		  any:
		    Any	of the newline sequences above,	and  the  Unicode  se-
		    quences  VT	(vertical tab, U+000B),	FF (formfeed, U+000C),
		    NEL	(next line, U+0085), LS	(line separator, U+2028),  and
		    PS (paragraph separator, U+2029).

		bsr_anycrlf:
		  Specifies  specifically that \R is to	match only the CR, LF,
		  or CRLF sequences, not the Unicode-specific newline  charac-
		  ters.

		bsr_unicode:
		  Specifies  specifically  that	\R is to match all the Unicode
		  newline characters (including	CRLF, and so on, the default).

		no_start_optimize:
		  Disables  optimization  that	can  malfunction  if  "Special
		  start-of-pattern  items"  are	present	in the regular expres-
		  sion.	A typical example  would  be  when  matching  "DEFABC"
		  against "(*COMMIT)ABC", where	the start optimization of PCRE
		  would	 skip the subject up to	"A" and	never realize that the
		  (*COMMIT) instruction	is to have  made  the  matching	 fail.
		  This	option	is  only relevant if you use "start-of-pattern
		  items", as discussed in section PCRE Regular Expression  De-
		  tails.

		ucp:
		  Specifies  that  Unicode character properties	are to be used
		  when resolving \B, \b, \D, \d, \S, \s, \W  and  \w.  Without
		  this	flag, only ISO Latin-1 properties are used. Using Uni-
		  code properties hurts	performance, but is semantically  cor-
		  rect	when  working  with  Unicode characters	beyond the ISO
		  Latin-1 range.

		never_utf:
		  Specifies that the (*UTF) and/or  (*UTF8)  "start-of-pattern
		  items"  are forbidden. This flag cannot be combined with op-
		  tion unicode.	Useful if ISO Latin-1 patterns from an	exter-
		  nal source are to be compiled.

       inspect(MP, Item) -> {namelist, [binary()]}

	      Types:

		 MP = mp()
		 Item =	namelist

	      Takes a compiled regular expression and an item, and returns the
	      relevant	data  from  the	regular	expression. The	only supported
	      item is  namelist,  which	 returns  the  tuple  {namelist,  [bi-
	      nary()]},	containing the names of	all (unique) named subpatterns
	      in the regular expression. For example:

	      1> {ok,MP} = re:compile("(?<A>A)|(?<B>B)|(?<C>C)").
	      {ok,{re_pattern,3,0,0,
			      <<69,82,67,80,119,0,0,0,0,0,0,0,1,0,0,0,255,255,255,255,
				255,255,...>>}}
	      2> re:inspect(MP,namelist).
	      {namelist,[<<"A">>,<<"B">>,<<"C">>]}
	      3> {ok,MPD} = re:compile("(?<C>A)|(?<B>B)|(?<C>C)",[dupnames]).
	      {ok,{re_pattern,3,0,0,
			      <<69,82,67,80,119,0,0,0,0,0,8,0,1,0,0,0,255,255,255,255,
				255,255,...>>}}
	      4> re:inspect(MPD,namelist).
	      {namelist,[<<"B">>,<<"C">>]}

	      Notice in	the second example that	the duplicate name only	occurs
	      once  in the returned list, and that the list is in alphabetical
	      order regardless of where	the names are positioned in the	 regu-
	      lar  expression. The order of the	names is the same as the order
	      of captured subexpressions if {capture, all_names} is  specified
	      as  an option to run/3. You can therefore	create a name-to-value
	      mapping from the result of run/3 like this:

	      1> {ok,MP} = re:compile("(?<A>A)|(?<B>B)|(?<C>C)").
	      {ok,{re_pattern,3,0,0,
			      <<69,82,67,80,119,0,0,0,0,0,0,0,1,0,0,0,255,255,255,255,
				255,255,...>>}}
	      2> {namelist, N} = re:inspect(MP,namelist).
	      {namelist,[<<"A">>,<<"B">>,<<"C">>]}
	      3> {match,L} = re:run("AA",MP,[{capture,all_names,binary}]).
	      {match,[<<"A">>,<<>>,<<>>]}
	      4> NameMap = lists:zip(N,L).
	      [{<<"A">>,<<"A">>},{<<"B">>,<<>>},{<<"C">>,<<>>}]

       replace(Subject,	RE, Replacement) -> iodata() | unicode:charlist()

	      Types:

		 Subject = iodata() | unicode:charlist()
		 RE = mp() | iodata()
		 Replacement = iodata()	| unicode:charlist() | replace_fun()

	      Same as replace(Subject, RE, Replacement,	[]).

       replace(Subject,	RE, Replacement, Options) ->
		  iodata() | unicode:charlist()

	      Types:

		 Subject = iodata() | unicode:charlist()
		 RE = mp() | iodata() |	unicode:charlist()
		 Replacement = iodata()	| unicode:charlist() | replace_fun()
		 Options = [Option]
		 Option	=
		     anchored |	global | notbol	| noteol | notempty |
		     notempty_atstart |
		     {offset, integer()	>= 0} |
		     {newline, NLSpec} |
		     bsr_anycrlf |
		     {match_limit, integer() >=	0} |
		     {match_limit_recursion, integer() >= 0} |
		     bsr_unicode |
		     {return, ReturnType} |
		     CompileOpt
		 ReturnType = iodata | list | binary
		 CompileOpt = compile_option()
		 NLSpec	= cr | crlf | lf | anycrlf | any

	      Replaces the matched part	of the Subject	string	with  Replace-
	      ment.

	      The  permissible	options	are the	same as	for run/3, except that
	      option capture is	not allowed. Instead a {return,	ReturnType} is
	      present. The default return type is iodata, constructed in a way
	      to minimize copying. The iodata result can be used  directly  in
	      many  I/O	 operations. If	a flat list() is desired, specify {re-
	      turn, list}. If a	binary is desired, specify {return, binary}.

	      As in function run/3, an mp() compiled with option  unicode  re-
	      quires  Subject  to  be  a Unicode charlist(). If	compilation is
	      done implicitly and the unicode compilation option is  specified
	      to this function,	both the regular expression and	Subject	are to
	      specified	as valid Unicode charlist()s.

	      If the replacement is given as a string, it can contain the spe-
	      cial character &,	which inserts the whole	matching expression in
	      the result, and the special sequence \N (where N is an integer >
	      0),  \gN,	 or \g{N}, resulting in	the subexpression number N, is
	      inserted in the result. If no subexpression with that number  is
	      generated	by the regular expression, nothing is inserted.

	      To insert	an & or	a \ in the result, precede it with a \.	Notice
	      that  Erlang  already  gives  a  special meaning to \ in literal
	      strings, so a single \ must be written as	"\\" and  therefore  a
	      double \ as "\\\\".

	      Example:

	      re:replace("abcd","c","[&]",[{return,list}]).

	      gives

	      "ab[c]d"

	      while

	      re:replace("abcd","c","[\\&]",[{return,list}]).

	      gives

	      "ab[&]d"

	      If the replacement is given as a fun, it will be called with the
	      whole  matching  expression  as the first	argument and a list of
	      subexpression matches in the order in which they appear  in  the
	      regular  expression.  The	returned value will be inserted	in the
	      result.

	      Example:

	      re:replace("abcd", ".(.)", fun(Whole, [<<C>>]) ->	<<$#, Whole/binary, $-,	(C - $a	+ $A), $#>> end, [{return, list}]).

	      gives

	      "#ab-B#cd"

	  Note:
	      Non-matching optional subexpressions will	not be included	in the
	      list of subexpression matches if they are	 the  last  subexpres-
	      sions in the regular expression.

	      Example:

	      The  regular  expression "(a)(b)?(c)?" ("a", optionally followed
	      by "b", optionally followed by "c") will	create	the  following
	      subexpression lists:

		* [<<"a">>, <<"b">>, <<"c">>] when applied to the string "abc"

		* [<<"a">>, <<>>, <<"c">>] when	applied	to the string "acx"

		* [<<"a">>, <<"b">>] when applied to the string	"abx"

		* [<<"a">>] when applied to the	string "axx"

	      As  with	run/3,	compilation errors raise the badarg exception.
	      compile/2	can be used to get more	information about the error.

       run(Subject, RE)	-> {match, Captured} | nomatch

	      Types:

		 Subject = iodata() | unicode:charlist()
		 RE = mp() | iodata()
		 Captured = [CaptureData]
		 CaptureData = {integer(), integer()}

	      Same as run(Subject,RE,[]).

       run(Subject, RE,	Options) ->
	      {match, Captured}	| match	| nomatch | {error, ErrType}

	      Types:

		 Subject = iodata() | unicode:charlist()
		 RE = mp() | iodata() |	unicode:charlist()
		 Options = [Option]
		 Option	=
		     anchored |	global | notbol	| noteol | notempty |
		     notempty_atstart |	report_errors |
		     {offset, integer()	>= 0} |
		     {match_limit, integer() >=	0} |
		     {match_limit_recursion, integer() >= 0} |
		     {newline, NLSpec :: nl_spec()} |
		     bsr_anycrlf | bsr_unicode |
		     {capture, ValueSpec} |
		     {capture, ValueSpec, Type}	|
		     CompileOpt
		 Type =	index |	list | binary
		 ValueSpec =
		     all | all_but_first | all_names | first  |	 none  |  Val-
		 ueList
		 ValueList = [ValueID]
		 ValueID = integer() | string()	| atom()
		 CompileOpt = compile_option()
		   See compile/2.
		 Captured = [CaptureData] | [[CaptureData]]
		 CaptureData =
		     {integer(), integer()} | ListConversionData | binary()
		 ListConversionData =
		     string() |
		     {error, string(), binary()} |
		     {incomplete, string(), binary()}
		 ErrType =
		     match_limit  |  match_limit_recursion  |  {compile,  Com-
		 pileErr}
		 CompileErr =
		     {ErrString	:: string(), Position :: integer() >= 0}

	      Executes	 a   regular   expression   matching,	and    returns
	      match/{match,  Captured}	or nomatch. The	regular	expression can
	      be specified either as iodata() in which case  it	 is  automati-
	      cally  compiled  (as by compile/2) and executed, or as a precom-
	      piled mp() in which case it is executed against the subject  di-
	      rectly.

	      When  compilation	 is  involved, exception badarg	is thrown if a
	      compilation error	occurs.	 Call  compile/2  to  get  information
	      about the	location of the	error in the regular expression.

	      If  the  regular	expression  is previously compiled, the	option
	      list can only contain the	following options:

		* anchored

		* {capture, ValueSpec}/{capture, ValueSpec, Type}

		* global

		* {match_limit,	integer() >= 0}

		* {match_limit_recursion, integer() >= 0}

		* {newline, NLSpec}

		* notbol

		* notempty

		* notempty_atstart

		* noteol

		* {offset, integer() >=	0}

		* report_errors

	      Otherwise	all options valid for function compile/2 are also  al-
	      lowed.  Options  allowed both for	compilation and	execution of a
	      match, namely anchored and {newline, NLSpec},  affect  both  the
	      compilation and execution	if present together with a non-precom-
	      piled regular expression.

	      If  the  regular	expression was previously compiled with	option
	      unicode,	Subject	 is  to	 be  provided  as  a   valid   Unicode
	      charlist(),  otherwise  any  iodata() will do. If	compilation is
	      involved and option unicode is specified,	both Subject  and  the
	      regular	expression  are	 to  be	 specified  as	valid  Unicode
	      charlists().

	      {capture,	ValueSpec}/{capture, ValueSpec,	Type} defines what  to
	      return  from  the	function upon successful matching. The capture
	      tuple can	contain	both a value specification, telling  which  of
	      the  captured substrings are to be returned, and a type specifi-
	      cation, telling how captured substrings are to be	 returned  (as
	      index  tuples, lists, or binaries). The options are described in
	      detail below.

	      If the capture options describe that no substring	 capturing  is
	      to  be  done  ({capture, none}), the function returns the	single
	      atom match upon successful matching, otherwise the tuple {match,
	      ValueList}. Disabling capturing can be done either by specifying
	      none or an empty list as ValueSpec.

	      Option report_errors adds	the possibility	that an	error tuple is
	      returned.	 The  tuple  either   indicates	  a   matching	 error
	      (match_limit  or match_limit_recursion), or a compilation	error,
	      where the	error tuple has	 the  format  {error,  {compile,  Com-
	      pileErr}}. Notice	that if	option report_errors is	not specified,
	      the function never returns error tuples, but reports compilation
	      errors  as  a badarg exception and failed	matches	because	of ex-
	      ceeded match limits simply as nomatch.

	      The following options are	relevant for execution:

		anchored:
		  Limits run/3 to matching at the first	matching position.  If
		  a  pattern  was  compiled with anchored, or turned out to be
		  anchored by virtue of	its contents, it cannot	be made	 unan-
		  chored  at  matching	time, hence there is no	unanchored op-
		  tion.

		global:
		  Implements global (repetitive) search	(flag g	in Perl). Each
		  match	is returned as a separate list() containing  the  spe-
		  cific	match and any matching subexpressions (or as specified
		  by  option capture. The Captured part	of the return value is
		  hence	a list() of list()s when this option is	specified.

		  The interaction of option global with	a  regular  expression
		  that	matches	an empty string	surprises some users. When op-
		  tion global is specified, run/3 handles empty	matches	in the
		  same way as Perl: a zero-length match	at any point  is  also
		  retried  with	 options [anchored, notempty_atstart]. If that
		  search gives a result	of length > 0, the result is included.
		  Example:

		re:run("cat","(|at)",[global]).

		  The following	matchings are performed:

		  At offset 0:
		    The	regular	expression (|at) first match  at  the  initial
		    position   of   string   cat,   giving   the   result  set
		    [{0,0},{0,0}] (the second {0,0} is because of  the	subex-
		    pression  marked by	the parentheses). As the length	of the
		    match is 0,	we do not advance to the next position yet.

		  At offset 0 with [anchored, notempty_atstart]:
		    The	search is retried with options [anchored, notempty_at-
		    start] at the same position, which does not	give  any  in-
		    teresting  result of longer	length,	so the search position
		    is advanced	to the next character (a).

		  At offset 1:
		    The	search results in [{1,0},{1,0}],  so  this  search  is
		    also repeated with the extra options.

		  At offset 1 with [anchored, notempty_atstart]:
		    Alternative	 ab  is	found and the result is	[{1,2},{1,2}].
		    The	result is added	to the list of results and  the	 posi-
		    tion in the	search string is advanced two steps.

		  At offset 3:
		    The	 search	 once  again  matches the empty	string,	giving
		    [{3,0},{3,0}].

		  At offset 1 with [anchored, notempty_atstart]:
		    This gives no result of length > 0 and we are at the  last
		    position, so the global search is complete.

		  The result of	the call is:

		{match,[[{0,0},{0,0}],[{1,0},{1,0}],[{1,2},{1,2}],[{3,0},{3,0}]]}

		notempty:
		  An  empty  string  is	 not considered	to be a	valid match if
		  this option is specified. If alternatives in the pattern ex-
		  ist, they are	tried. If all the alternatives match the empty
		  string, the entire match fails.

		  Example:

		  If the following pattern is applied to a string  not	begin-
		  ning	with  "a"  or  "b",  it	would normally match the empty
		  string at the	start of the subject:

		a?b?

		  With option  notempty,  this	match  is  invalid,  so	 run/3
		  searches  further  into the string for occurrences of	"a" or
		  "b".

		notempty_atstart:
		  Like notempty, except	that an	empty string match that	is not
		  at the start of the subject is permitted. If the pattern  is
		  anchored,  such  a  match can	occur only if the pattern con-
		  tains	\K.

		  Perl has no direct equivalent	of  notempty  or  notempty_at-
		  start, but it	does make a special case of a pattern match of
		  the empty string within its split() function,	and when using
		  modifier  /g.	The Perl behavior can be emulated after	match-
		  ing a	null string by first trying the	 match	again  at  the
		  same offset with notempty_atstart and	anchored, and then, if
		  that fails, by advancing the starting	offset (see below) and
		  trying an ordinary match again.

		notbol:
		  Specifies  that the first character of the subject string is
		  not the beginning of a line, so the circumflex metacharacter
		  is not to match before it. Setting  this  without  multiline
		  (at compile time) causes circumflex never to match. This op-
		  tion only affects the	behavior of the	circumflex metacharac-
		  ter. It does not affect \A.

		noteol:
		  Specifies  that the end of the subject string	is not the end
		  of a line, so	the dollar metacharacter is not	 to  match  it
		  nor  (except in multiline mode) a newline immediately	before
		  it. Setting this without multiline (at compile time)	causes
		  dollar never to match. This option affects only the behavior
		  of the dollar	metacharacter. It does not affect \Z or	\z.

		report_errors:
		  Gives	 better	 control  of the error handling	in run/3. When
		  specified, compilation errors	(if the	regular	expression  is
		  not  already compiled) and runtime errors are	explicitly re-
		  turned as an error tuple.

		  The following	are the	possible runtime errors:

		  match_limit:
		    The	PCRE library sets a limit on how many times the	inter-
		    nal	match function can be called. Defaults	to  10,000,000
		    in	 the   library	 compiled   for	  Erlang.  If  {error,
		    match_limit} is returned, the execution of the regular ex-
		    pression has reached this limit. This is  normally	to  be
		    regarded  as  a nomatch, which is the default return value
		    when this occurs, but by specifying	report_errors, you are
		    informed when the match fails because of too many internal
		    calls.

		  match_limit_recursion:
		    This error is very similar to match_limit, but occurs when
		    the	internal  match	 function  of  PCRE  is	 "recursively"
		    called  more  times	 than the match_limit_recursion	limit,
		    which defaults to 10,000,000 as well. Notice that as  long
		    as the match_limit and match_limit_default values are kept
		    at	the  default  values,  the match_limit_recursion error
		    cannot occur, as the match_limit error occurs before  that
		    (each  recursive call is also a call, but not conversely).
		    Both limits	can however be changed,	either by setting lim-
		    its	directly in the	regular	expression string (see section
		    PCRE Regular Eexpression Details) or by specifying options
		    to run/3.

		  It is	important to understand	that what is  referred	to  as
		  "recursion"  when limiting matches is	not recursion on the C
		  stack	of the Erlang machine or on the	Erlang process	stack.
		  The  PCRE  version  compiled into the	Erlang VM uses machine
		  "heap" memory	to store values	that must be kept over	recur-
		  sion in regular expression matches.

		{match_limit, integer()	>= 0}:
		  Limits  the  execution time of a match in an implementation-
		  specific way.	It is described	as follows by the  PCRE	 docu-
		  mentation:

		The match_limit	field provides a means of preventing PCRE from using
		up a vast amount of resources when running patterns that are not going
		to match, but which have a very	large number of	possibilities in their
		search trees. The classic example is a pattern that uses nested
		unlimited repeats.

		Internally, pcre_exec()	uses a function	called match(),	which it calls
		repeatedly (sometimes recursively). The	limit set by match_limit is
		imposed	on the number of times this function is	called during a	match,
		which has the effect of	limiting the amount of backtracking that can
		take place. For	patterns that are not anchored,	the count restarts
		from zero for each position in the subject string.

		  This	means that runaway regular expression matches can fail
		  faster if the	limit is lowered using this  option.  The  de-
		  fault	value 10,000,000 is compiled into the Erlang VM.

	    Note:
		This  option does in no	way affect the execution of the	Erlang
		VM in terms of "long running BIFs". run/3 always gives control
		back to	the scheduler of Erlang	processes  at  intervals  that
		ensures	the real-time properties of the	Erlang system.

		{match_limit_recursion,	integer() >= 0}:
		  Limits  the execution	time and memory	consumption of a match
		  in  an  implementation-specific   way,   very	  similar   to
		  match_limit. It is described as follows by the PCRE documen-
		  tation:

		The match_limit_recursion field	is similar to match_limit, but instead
		of limiting the	total number of	times that match() is called, it
		limits the depth of recursion. The recursion depth is a	smaller	number
		than the total number of calls,	because	not all	calls to match() are
		recursive. This	limit is of use	only if	it is set smaller than
		match_limit.

		Limiting the recursion depth limits the	amount of machine stack	that
		can be used, or, when PCRE has been compiled to	use memory on the heap
		instead	of the stack, the amount of heap memory	that can be used.

		  The  Erlang VM uses a	PCRE library where heap	memory is used
		  when regular expression match	recursion occurs. This	there-
		  fore limits the use of machine heap, not C stack.

		  Specifying a lower value can result in matches with deep re-
		  cursion failing, when	they should have matched:

		1> re:run("aaaaaaaaaaaaaz","(a+)*z").
		{match,[{0,14},{0,13}]}
		2> re:run("aaaaaaaaaaaaaz","(a+)*z",[{match_limit_recursion,5}]).
		nomatch
		3> re:run("aaaaaaaaaaaaaz","(a+)*z",[{match_limit_recursion,5},report_errors]).
		{error,match_limit_recursion}

		  This	option	and  option match_limit	are only to be used in
		  rare cases. Understanding of the PCRE	library	 internals  is
		  recommended before tampering with these limits.

		{offset, integer() >= 0}:
		  Start	 matching  at  the  offset (position) specified	in the
		  subject string. The offset is	zero-based, so	that  the  de-
		  fault	is {offset,0} (all of the subject string).

		{newline, NLSpec}:
		  Overrides the	default	definition of a	newline	in the subject
		  string, which	is LF (ASCII 10) in Erlang.

		  cr:
		    Newline is indicated by a single character CR (ASCII 13).

		  lf:
		    Newline  is	indicated by a single character	LF (ASCII 10),
		    the	default.

		  crlf:
		    Newline is indicated by the	two-character CRLF  (ASCII  13
		    followed by	ASCII 10) sequence.

		  anycrlf:
		    Any	of the three preceding sequences is be recognized.

		  any:
		    Any	 of  the  newline sequences above, and the Unicode se-
		    quences VT (vertical tab, U+000B), FF (formfeed,  U+000C),
		    NEL	 (next line, U+0085), LS (line separator, U+2028), and
		    PS (paragraph separator, U+2029).

		bsr_anycrlf:
		  Specifies specifically that \R is to match only the  CR  LF,
		  or  CRLF sequences, not the Unicode-specific newline charac-
		  ters.	(Overrides the compilation option.)

		bsr_unicode:
		  Specifies specifically that \R is to match all  the  Unicode
		  newline characters (including	CRLF, and so on, the default).
		  (Overrides the compilation option.)

		{capture, ValueSpec}/{capture, ValueSpec, Type}:
		  Specifies which captured substrings are returned and in what
		  format.  By default, run/3 captures all of the matching part
		  of the substring and all capturing subpatterns (all  of  the
		  pattern  is automatically captured). The default return type
		  is (zero-based) indexes of the captured parts	of the string,
		  specified as {Offset,Length} pairs (the index	Type  of  cap-
		  turing).

		  As  an  example  of the default behavior, the	following call
		  returns, as first and	only  captured	string,	 the  matching
		  part	of the subject ("abcd" in the middle) as an index pair
		  {3,4}, where character positions are zero-based, just	as  in
		  offsets:

		re:run("ABCabcdABC","abcd",[]).

		  The return value of this call	is:

		{match,[{3,4}]}

		  Another (and quite common) case is where the regular expres-
		  sion matches all of the subject:

		re:run("ABCabcdABC",".*abcd.*",[]).

		  Here	the return value correspondingly points	out all	of the
		  string, beginning at index 0,	and it is 10 characters	long:

		{match,[{0,10}]}

		  If the regular expression  contains  capturing  subpatterns,
		  like in:

		re:run("ABCabcdABC",".*(abcd).*",[]).

		  all  of the matched subject is captured, as well as the cap-
		  tured	substrings:

		{match,[{0,10},{3,4}]}

		  The complete matching	pattern	always gives the first	return
		  value	in the list and	the remaining subpatterns are added in
		  the order they occurred in the regular expression.

		  The capture tuple is built up	as follows:

		  ValueSpec:
		    Specifies which captured (sub)patterns are to be returned.
		    ValueSpec  can  either  be an atom describing a predefined
		    set	of return values, or a list containing the indexes  or
		    the	names of specific subpatterns to return.

		    The	following are the predefined sets of subpatterns:

		    all:
		      All captured subpatterns including the complete matching
		      string. This is the default.

		    all_names:
		      All named	subpatterns in the regular expression, as if a
		      list() of	all the	names in alphabetical order was	speci-
		      fied.  The  list of all names can	also be	retrieved with
		      inspect/2.

		    first:
		      Only the first captured subpattern, which	is always  the
		      complete	matching  part	of the subject.	All explicitly
		      captured subpatterns are discarded.

		    all_but_first:
		      All but the first	matching subpattern, that is, all  ex-
		      plicitly	captured  subpatterns,	but  not  the complete
		      matching part of the subject string. This	is  useful  if
		      the  regular  expression as a whole matches a large part
		      of the subject, but the part you are interested in is in
		      an explicitly captured subpattern. If the	return type is
		      list or binary, not returning subpatterns	 you  are  not
		      interested in is a good way to optimize.

		    none:
		      Returns  no  matching subpatterns, gives the single atom
		      match as the return value	of the function	when  matching
		      successfully  instead  of	 the  {match,  list()} return.
		      Specifying an empty list gives the same behavior.

		    The	value list is a	list of	indexes	for the	subpatterns to
		    return, where index	0 is for all of	the pattern, and 1  is
		    for	the first explicit capturing subpattern	in the regular
		    expression,	 and  so on. When using	named captured subpat-
		    terns (see below) in the regular expression, one  can  use
		    atom()s  or	string()s to specify the subpatterns to	be re-
		    turned. For	example, consider the regular expression:

		  ".*(abcd).*"

		    matched against string "ABCabcdABC",  capturing  only  the
		    "abcd" part	(the first explicit subpattern):

		  re:run("ABCabcdABC",".*(abcd).*",[{capture,[1]}]).

		    The	 call gives the	following result, as the first explic-
		    itly captured subpattern is	"(abcd)", matching  "abcd"  in
		    the	subject, at (zero-based) position 3, of	length 4:

		  {match,[{3,4}]}

		    Consider the same regular expression, but with the subpat-
		    tern explicitly named 'FOO':

		  ".*(?<FOO>abcd).*"

		    With this expression, we could still give the index	of the
		    subpattern with the	following call:

		  re:run("ABCabcdABC",".*(?<FOO>abcd).*",[{capture,[1]}]).

		    giving  the	 same result as	before.	But, as	the subpattern
		    is named, we can also specify its name in the value	list:

		  re:run("ABCabcdABC",".*(?<FOO>abcd).*",[{capture,['FOO']}]).

		    This would give the	same result as the  earlier  examples,
		    namely:

		  {match,[{3,4}]}

		    The	 values	 list can specify indexes or names not present
		    in the regular expression, in which	case the return	values
		    vary depending on the type.	If the type is index, the  tu-
		    ple	 {-1,0}	 is  returned for values with no corresponding
		    subpattern in the regular expression, but  for  the	 other
		    types  (binary  and	list), the values are the empty	binary
		    or list, respectively.

		  Type:
		    Optionally specifies how captured substrings are to	be re-
		    turned. If omitted,	the default of index is	used.

		    Type can be	one of the following:

		    index:
		      Returns captured substrings as  pairs  of	 byte  indexes
		      into  the	 subject  string  and  length  of the matching
		      string in	the subject (as	 if  the  subject  string  was
		      flattened	  with	 erlang:iolist_to_binary/1   or	  uni-
		      code:characters_to_binary/2  before  matching).	Notice
		      that  option unicode results in byte-oriented indexes in
		      a	(possibly virtual) UTF-8 encoded binary. A byte	 index
		      tuple  {0,2}  can	therefore represent one	or two charac-
		      ters when	unicode	is in effect. This can	seem  counter-
		      intuitive,  but  has  been deemed	the most effective and
		      useful way to do it. To return lists instead can	result
		      in  simpler code if that is desired. This	return type is
		      the default.

		    list:
		      Returns matching substrings as lists of characters  (Er-
		      lang  string()s).	 It option unicode is used in combina-
		      tion with	the \C sequence	in the regular	expression,  a
		      captured subpattern can contain bytes that are not valid
		      UTF-8  (\C  matches bytes	regardless of character	encod-
		      ing). In that case the list capturing can	result in  the
		      same  types  of tuples that unicode:characters_to_list/2
		      can return, namely three-tuples with tag	incomplete  or
		      error, the successfully converted	characters and the in-
		      valid UTF-8 tail of the conversion as a binary. The best
		      strategy	is to avoid using the \C sequence when captur-
		      ing lists.

		    binary:
		      Returns matching substrings as binaries. If option  uni-
		      code is used, these binaries are in UTF-8. If the	\C se-
		      quence  is  used together	with unicode, the binaries can
		      be invalid UTF-8.

		  In general, subpatterns that were not	assigned  a  value  in
		  the  match are returned as the tuple {-1,0} when type	is in-
		  dex. Unassigned subpatterns are returned as the empty	binary
		  or list, respectively, for other return types. Consider  the
		  following regular expression:

		".*((?<FOO>abdd)|a(..d)).*"

		  There	 are three explicitly capturing	subpatterns, where the
		  opening parenthesis position determines the order in the re-
		  sult,	hence ((?<FOO>abdd)|a(..d))  is	 subpattern  index  1,
		  (?<FOO>abdd)	is subpattern index 2, and (..d) is subpattern
		  index	3. When	matched	against	the following string:

		"ABCabcdABC"

		  the subpattern at index 2 does not match, as "abdd"  is  not
		  present in the string, but the complete pattern matches (be-
		  cause	 of the	alternative a(..d)). The subpattern at index 2
		  is therefore unassigned and the default return value is:

		{match,[{0,10},{3,4},{-1,0},{4,3}]}

		  Setting the capture Type to binary gives:

		{match,[<<"ABCabcdABC">>,<<"abcd">>,<<>>,<<"bcd">>]}

		  Here the empty binary	(<<>>) represents the unassigned  sub-
		  pattern.  In	the  binary  case,  some information about the
		  matching is therefore	lost, as <<>> can  also	 be  an	 empty
		  string captured.

		  If  differentiation  between	empty matches and non-existing
		  subpatterns is necessary, use	the type index and do the con-
		  version to the final type in Erlang code.

		  When option global is	speciified, the	capture	 specification
		  affects each match separately, so that:

		re:run("cacb","c(a|b)",[global,{capture,[1],list}]).

		  gives

		{match,[["a"],["b"]]}

	      For  a  descriptions  of	options	only affecting the compilation
	      step, see	compile/2.

       split(Subject, RE) -> SplitList

	      Types:

		 Subject = iodata() | unicode:charlist()
		 RE = mp() | iodata()
		 SplitList = [iodata() | unicode:charlist()]

	      Same as split(Subject, RE, []).

       split(Subject, RE, Options) -> SplitList

	      Types:

		 Subject = iodata() | unicode:charlist()
		 RE = mp() | iodata() |	unicode:charlist()
		 Options = [Option]
		 Option	=
		     anchored |	notbol | noteol	| notempty |  notempty_atstart
		 |
		     {offset, integer()	>= 0} |
		     {newline, nl_spec()} |
		     {match_limit, integer() >=	0} |
		     {match_limit_recursion, integer() >= 0} |
		     bsr_anycrlf | bsr_unicode |
		     {return, ReturnType} |
		     {parts, NumParts} |
		     group | trim | CompileOpt
		 NumParts = integer() >= 0 | infinity
		 ReturnType = iodata | list | binary
		 CompileOpt = compile_option()
		   See compile/2.
		 SplitList = [RetData] | [GroupedRetData]
		 GroupedRetData	= [RetData]
		 RetData = iodata() | unicode:charlist() | binary() | list()

	      Splits  the  input into parts by finding tokens according	to the
	      regular expression supplied. The splitting is basically done  by
	      running  a global	regular	expression match and dividing the ini-
	      tial string wherever a match occurs. The matching	 part  of  the
	      string is	removed	from the output.

	      As  in run/3, an mp() compiled with option unicode requires Sub-
	      ject to be a Unicode charlist(). If compilation is done  implic-
	      itly  and	 the  unicode  compilation option is specified to this
	      function,	both the regular expression  and  Subject  are	to  be
	      specified	as valid Unicode charlist()s.

	      The  result  is given as a list of "strings", the	preferred data
	      type specified in	option return (default iodata).

	      If subexpressions	are specified in the regular  expression,  the
	      matching	subexpressions	are  returned in the resulting list as
	      well. For	example:

	      re:split("Erlang","[ln]",[{return,list}]).

	      gives

	      ["Er","a","g"]

	      while

	      re:split("Erlang","([ln])",[{return,list}]).

	      gives

	      ["Er","l","a","n","g"]

	      The text matching	the subexpression (marked by  the  parentheses
	      in  the regular expression) is inserted in the result list where
	      it was found. This means that  concatenating  the	 result	 of  a
	      split  where the whole regular expression	is a single subexpres-
	      sion (as in the last example) always  results  in	 the  original
	      string.

	      As  there	 is no matching	subexpression for the last part	in the
	      example (the "g"), nothing is inserted after that. To  make  the
	      group  of	strings	and the	parts matching the subexpressions more
	      obvious, one can use option group,  which	 groups	 together  the
	      part  of	the  subject string with the parts matching the	subex-
	      pressions	when the string	was split:

	      re:split("Erlang","([ln])",[{return,list},group]).

	      gives

	      [["Er","l"],["a","n"],["g"]]

	      Here the regular expression first	matched	the "l", causing  "Er"
	      to  be the first part in the result. When	the regular expression
	      matched, the (only) subexpression	was bound to the "l",  so  the
	      "l"  is inserted in the group together with "Er".	The next match
	      is of the	"n", making "a"	the next part to be returned.  As  the
	      subexpression is bound to	substring "n" in this case, the	"n" is
	      inserted into this group.	The last group consists	of the remain-
	      ing string, as no	more matches are found.

	      By  default,  all	 parts	of  the	 string,  including  the empty
	      strings, are returned from the function, for example:

	      re:split("Erlang","[lg]",[{return,list}]).

	      gives

	      ["Er","an",[]]

	      as the matching of the "g" in the	end of the  string  leaves  an
	      empty  rest,  which is also returned. This behavior differs from
	      the default behavior of the split	function in Perl, where	 empty
	      strings at the end are by	default	removed. To get	the "trimming"
	      default behavior of Perl,	specify	trim as	an option:

	      re:split("Erlang","[lg]",[{return,list},trim]).

	      gives

	      ["Er","an"]

	      The  "trim"  option says;	"give me as many parts as possible ex-
	      cept the empty ones", which sometimes can	 be  useful.  You  can
	      also specify how many parts you want, by specifying {parts,N}:

	      re:split("Erlang","[lg]",[{return,list},{parts,2}]).

	      gives

	      ["Er","ang"]

	      Notice  that  the	last part is "ang", not	"an", as splitting was
	      specified	into two parts,	and the	splitting  stops  when	enough
	      parts  are  given,  which	is why the result differs from that of
	      trim.

	      More than	three parts are	not possible with this indata, so

	      re:split("Erlang","[lg]",[{return,list},{parts,4}]).

	      gives the	same result as the default, which is to	be  viewed  as
	      "an infinite number of parts".

	      Specifying 0 as the number of parts gives	the same effect	as op-
	      tion  trim. If subexpressions are	captured, empty	subexpressions
	      matched at the end are also stripped from	the result if trim  or
	      {parts,0}	is specified.

	      The  trim	 behavior  corresponds	exactly	 to  the Perl default.
	      {parts,N}, where N is a positive integer,	corresponds exactly to
	      the Perl behavior	with a positive	numerical third	parameter. The
	      default behavior of split/3 corresponds  to  the	Perl  behavior
	      when  a negative integer is specified as the third parameter for
	      the Perl routine.

	      Summary of options not previously	described for function run/3:

		{return,ReturnType}:
		  Specifies how	the parts of the original string are presented
		  in the result	list. Valid types:

		  iodata:
		    The	variant	of iodata() that gives the  least  copying  of
		    data  with the current implementation (often a binary, but
		    do not depend on it).

		  binary:
		    All	parts returned as binaries.

		  list:
		    All	parts returned as lists	of characters ("strings").

		group:
		  Groups together the part of the string with the parts	of the
		  string matching the subexpressions of	 the  regular  expres-
		  sion.

		  The  return value from the function is in this case a	list()
		  of list()s. Each sublist begins with the string  picked  out
		  of  the  subject string, followed by the parts matching each
		  of the subexpressions	in order of occurrence in the  regular
		  expression.

		{parts,N}:
		  Specifies  the  number  of parts the subject string is to be
		  split	into.

		  The number of	parts is to be a positive integer for  a  spe-
		  cific	 maximum number	of parts, and infinity for the maximum
		  number of parts possible (the	default). Specifying {parts,0}
		  gives	as many	parts as possible disregarding empty parts  at
		  the end, the same as specifying trim.

		trim:
		  Specifies that empty parts at	the end	of the result list are
		  to  be  disregarded.	The same as specifying {parts,0}. This
		  corresponds to the default behavior of  the  split  built-in
		  function in Perl.

PERL-LIKE REGULAR EXPRESSION SYNTAX
       The  following  sections	contain	reference material for the regular ex-
       pressions used by this module. The information is  based	 on  the  PCRE
       documentation,  with  changes  where this module	behaves	differently to
       the PCRE	library.

PCRE REGULAR EXPRESSION	DETAILS
       The syntax and semantics	of the regular expressions supported  by  PCRE
       are  described  in detail in the	following sections. Perl's regular ex-
       pressions are described in its own documentation, and  regular  expres-
       sions in	general	are covered in many books, some	with copious examples.
       Jeffrey	 Friedl's   "Mastering	 Regular  Expressions",	 published  by
       O'Reilly, covers	regular	expressions in great detail. This  description
       of the PCRE regular expressions is intended as reference	material.

       The reference material is divided into the following sections:

	 * Special Start-of-Pattern Items

	 * Characters and Metacharacters

	 * Backslash

	 * Circumflex and Dollar

	 * Full	Stop (Period, Dot) and \N

	 * Matching a Single Data Unit

	 * Square Brackets and Character Classes

	 * Posix Character Classes

	 * Vertical Bar

	 * Internal Option Setting

	 * Subpatterns

	 * Duplicate Subpattern	Numbers

	 * Named Subpatterns

	 * Repetition

	 * Atomic Grouping and Possessive Quantifiers

	 * Back	References

	 * Assertions

	 * Conditional Subpatterns

	 * Comments

	 * Recursive Patterns

	 * Subpatterns as Subroutines

	 * Oniguruma Subroutine	Syntax

	 * Backtracking	Control

SPECIAL	START-OF-PATTERN ITEMS
       Some options that can be	passed to compile/2 can	also be	set by special
       items at	the start of a pattern.	These are not Perl-compatible, but are
       provided	 to  make  these options accessible to pattern writers who are
       not able	to change the program that processes the pattern.  Any	number
       of  these  items	can appear, but	they must all be together right	at the
       start of	the pattern string, and	the letters must be in upper case.

       UTF Support

       Unicode support is basically UTF-8 based. To  use  Unicode  characters,
       you  either call	compile/2 or run/3 with	option unicode,	or the pattern
       must start with one of these special sequences:

       (*UTF8)
       (*UTF)

       Both options give the same effect, the input string is  interpreted  as
       UTF-8. Notice that with these instructions, the automatic conversion of
       lists  to  UTF-8	is not performed by the	re functions. Therefore, using
       these sequences is not recommended. Add	option	unicode	 when  running
       compile/2 instead.

       Some applications that allow their users	to supply patterns can wish to
       restrict	them to	non-UTF	data for security reasons. If option never_utf
       is  set	at compile time, (*UTF), and so	on, are	not allowed, and their
       appearance causes an error.

       Unicode Property	Support

       The following is	another	special	sequence that can appear at the	 start
       of a pattern:

       (*UCP)

       This  has  the  same  effect as setting option ucp: it causes sequences
       such as \d and \w to use	 Unicode  properties  to  determine  character
       types,  instead of recognizing only characters with codes < 256 through
       a lookup	table.

       Disabling Startup Optimizations

       If a pattern starts with	(*NO_START_OPT), it has	 the  same  effect  as
       setting option no_start_optimize	at compile time.

       Newline Conventions

       PCRE supports five conventions for indicating line breaks in strings: a
       single  CR (carriage return) character, a single	LF (line feed) charac-
       ter, the	two-character sequence CRLF, any of the	three  preceding,  and
       any Unicode newline sequence.

       A newline convention can	also be	specified by starting a	pattern	string
       with one	of the following five sequences:

	 (*CR):
	   Carriage return

	 (*LF):
	   Line	feed

	 (*CRLF):
	   >Carriage return followed by	line feed

	 (*ANYCRLF):
	   Any of the three above

	 (*ANY):
	   All Unicode newline sequences

       These  override the default and the options specified to	compile/2. For
       example,	the following pattern changes the convention to	CR:

       (*CR)a.b

       This pattern matches a\nb, as LF	is no longer a newline.	If  more  than
       one of them is present, the last	one is used.

       The  newline  convention	affects	where the circumflex and dollar	asser-
       tions are true. It also affects the interpretation of the dot metachar-
       acter when dotall is not	set, and the behavior of \N. However, it  does
       not affect what the \R escape sequence matches. By default, this	is any
       Unicode	newline	sequence, for Perl compatibility. However, this	can be
       changed;	see the	description of \R  in  section	Newline	 Sequences.  A
       change  of  the \R setting can be combined with a change	of the newline
       convention.

       Setting Match and Recursion Limits

       The caller of run/3 can set a limit on the number of times the internal
       match() function	is called and on the maximum depth of recursive	calls.
       These facilities	are provided to	catch runaway matches  that  are  pro-
       voked by	patterns with huge matching trees (a typical example is	a pat-
       tern  with nested unlimited repeats) and	to avoid running out of	system
       stack by	too much recursion. When  one  of  these  limits  is  reached,
       pcre_exec()  gives an error return. The limits can also be set by items
       at the start of the pattern of the following forms:

       (*LIMIT_MATCH=d)
       (*LIMIT_RECURSION=d)

       Here d is any number of decimal digits. However,	the value of the  set-
       ting  must  be less than	the value set by the caller of run/3 for it to
       have any	effect.	That is, the pattern writer can	lower the limit	set by
       the programmer, but not raise it. If there is more than one setting  of
       one of these limits, the	lower value is used.

       The  default  value for both the	limits is 10,000,000 in	the Erlang VM.
       Notice that the recursion limit does not	affect the stack depth of  the
       VM,  as	PCRE for Erlang	is compiled in such a way that the match func-
       tion never does recursion on the	C stack.

       Note that LIMIT_MATCH and LIMIT_RECURSION can only reduce the value  of
       the limits set by the caller, not increase them.

CHARACTERS AND METACHARACTERS
       A  regular  expression  is  a pattern that is matched against a subject
       string from left	to right. Most characters stand	for  themselves	 in  a
       pattern	and  match  the	 corresponding characters in the subject. As a
       trivial example,	the following pattern matches a	portion	of  a  subject
       string that is identical	to itself:

       The quick brown fox

       When  caseless  matching	 is  specified	(option	caseless), letters are
       matched independently of	case.

       The power of regular expressions	comes from the ability to include  al-
       ternatives  and	repetitions  in	 the pattern. These are	encoded	in the
       pattern by the use of metacharacters, which do not stand	for themselves
       but instead are interpreted in some special way.

       Two sets	of metacharacters exist: those that are	recognized anywhere in
       the pattern except within square	brackets, and those  that  are	recog-
       nized  within square brackets. Outside square brackets, the metacharac-
       ters are	as follows:

	 \:
	   General escape character with many uses

	 ^:
	   Assert start	of string (or line, in multiline mode)

	 $:
	   Assert end of string	(or line, in multiline mode)

	 .:
	   Match any character except newline (by default)

	 [:
	   Start character class definition

	 |:
	   Start of alternative	branch

	 (:
	   Start subpattern

	 ):
	   End subpattern

	 ?:
	   Extends the meaning of (, also 0 or 1 quantifier,  also  quantifier
	   minimizer

	 *:
	   0 or	more quantifiers

	 +:
	   1 or	more quantifier, also "possessive quantifier"

	 {:
	   Start min/max quantifier

       Part of a pattern within	square brackets	is called a "character class".
       The following are the only metacharacters in a character	class:

	 \:
	   General escape character

	 ^:
	   Negate the class, but only if the first character

	 -:
	   Indicates character range

	 [:
	   Posix character class (only if followed by Posix syntax)

	 ]:
	   Terminates the character class

       The following sections describe the use of each metacharacter.

BACKSLASH
       The  backslash  character  has many uses. First,	if it is followed by a
       character that is not a number or a letter, it takes away  any  special
       meaning	that  a	character can have. This use of	backslash as an	escape
       character applies both inside and outside character classes.

       For example, if you want	to match a * character,	you write  \*  in  the
       pattern.	 This escaping action applies if the following character would
       otherwise be interpreted	as a metacharacter, so it is  always  safe  to
       precede a non-alphanumeric with backslash to specify that it stands for
       itself. In particular, if you want to match a backslash,	write \\.

       In  unicode mode, only ASCII numbers and	letters	have any special mean-
       ing after a backslash. All other	characters (in particular, those whose
       code points are > 127) are treated as literals.

       If a pattern is compiled	with option extended, whitespace in  the  pat-
       tern  (other than in a character	class) and characters between a	# out-
       side a character	class and the next newline are	ignored.  An  escaping
       backslash can be	used to	include	a whitespace or	# character as part of
       the pattern.

       To  remove  the special meaning from a sequence of characters, put them
       between \Q and \E. This is different from Perl in that $	and @ are han-
       dled as literals	in \Q...\E sequences in	PCRE,  while  $	 and  @	 cause
       variable	interpolation in Perl. Notice the following examples:

       Pattern		  PCRE matches	 Perl matches

       \Qabc$xyz\E	  abc$xyz	 abc followed by the contents of $xyz
       \Qabc\$xyz\E	  abc\$xyz	 abc\$xyz
       \Qabc\E\$\Qxyz\E	  abc$xyz	 abc$xyz

       The  \Q...\E  sequence  is recognized both inside and outside character
       classes.	An isolated \E that is not preceded by \Q is ignored. If \Q is
       not followed by \E later	in the	pattern,  the  literal	interpretation
       continues  to  the  end	of  the	pattern	(that is, \E is	assumed	at the
       end). If	the isolated \Q	is inside a character class,  this  causes  an
       error, as the character class is	not terminated.

       Non-Printing Characters

       A second	use of backslash provides a way	of encoding non-printing char-
       acters  in patterns in a	visible	manner.	There is no restriction	on the
       appearance of non-printing characters, apart from the binary zero  that
       terminates a pattern. When a pattern is prepared	by text	editing, it is
       often  easier to	use one	of the following escape	sequences than the bi-
       nary character it represents:

	 \a:
	   Alarm, that is, the BEL character (hex 07)

	 \cx:
	   "Control-x",	where x	is any ASCII character

	 \e:
	   Escape (hex 1B)

	 \f:
	   Form	feed (hex 0C)

	 \n:
	   Line	feed (hex 0A)

	 \r:
	   Carriage return (hex	0D)

	 \t:
	   Tab (hex 09)

	 \0dd:
	   Character with octal	code 0dd

	 \ddd:
	   Character with octal	code ddd, or back reference

	 \o{ddd..}:
	   character with octal	code ddd..

	 \xhh:
	   Character with hex code hh

	 \x{hhh..}:
	   Character with hex code hhh..

   Note:
       Note that \0dd is always	an octal code, and that	\8 and \9 are the lit-
       eral characters "8" and "9".

       The precise effect of \cx on ASCII characters is	as follows: if x is  a
       lowercase  letter,  it  is  converted  to upper case. Then bit 6	of the
       character (hex 40) is inverted. Thus \cA	to \cZ become hex 01 to	hex 1A
       (A is 41, Z is 5A), but \c{ becomes hex 3B ({ is	7B), and  \c;  becomes
       hex  7B (; is 3B). If the data item (byte or 16-bit value) following \c
       has a value > 127, a compile-time error occurs.	This  locks  out  non-
       ASCII characters	in all modes.

       The  \c	facility  was designed for use with ASCII characters, but with
       the extension to	Unicode	it is even less	useful than it once was.

       After \0	up to two further octal	digits are read. If  there  are	 fewer
       than  two  digits,  just	 those that are	present	are used. Thus the se-
       quence \0\x\015 specifies two binary zeros followed by a	 CR  character
       (code value 13).	Make sure you supply two digits	after the initial zero
       if the pattern character	that follows is	itself an octal	digit.

       The  escape \o must be followed by a sequence of	octal digits, enclosed
       in braces. An error occurs if this is not the case. This	 escape	 is  a
       recent  addition	 to Perl; it provides way of specifying	character code
       points as octal numbers greater than 0777, and  it  also	 allows	 octal
       numbers and back	references to be unambiguously specified.

       For greater clarity and unambiguity, it is best to avoid	following \ by
       a digit greater than zero. Instead, use \o{} or \x{} to specify charac-
       ter  numbers,  and \g{} to specify back references. The following para-
       graphs describe the old,	ambiguous syntax.

       The handling of a backslash followed by a digit other than 0 is compli-
       cated, and Perl has changed in recent releases, causing	PCRE  also  to
       change. Outside a character class, PCRE reads the digit and any follow-
       ing  digits as a	decimal	number.	If the number is < 8, or if there have
       been at least that many previous	capturing left parentheses in the  ex-
       pression,  the entire sequence is taken as a back reference. A descrip-
       tion of how this	works is provided later, following the	discussion  of
       parenthesized subpatterns.

       Inside  a  character class, or if the decimal number following \	is > 7
       and there have not been that many capturing subpatterns,	 PCRE  handles
       \8 and \9 as the	literal	characters "8" and "9",	and otherwise re-reads
       up  to  three  octal  digits following the backslash, and using them to
       generate	a data character. Any subsequent digits	stand for  themselves.
       For example:

	 \040:
	   Another way of writing an ASCII space

	 \40:
	   The same, provided there are	< 40 previous capturing	subpatterns

	 \7:
	   Always a back reference

	 \11:
	   Can be a back reference, or another way of writing a	tab

	 \011:
	   Always a tab

	 \0113:
	   A tab followed by character "3"

	 \113:
	   Can	be  a  back reference, otherwise the character with octal code
	   113

	 \377:
	   Can be a back reference, otherwise value 255	(decimal)

	 \81:
	   Either a back reference, or the two characters "8" and "1"

       Notice that octal values	>= 100 that are	specified  using  this	syntax
       must  not  be introduced	by a leading zero, as no more than three octal
       digits are ever read.

       By default, after \x that is not	followed by {, from zero to two	 hexa-
       decimal	digits	are  read (letters can be in upper or lower case). Any
       number of hexadecimal digits may	appear between \x{ and }. If a charac-
       ter other than a	hexadecimal digit appears between \x{  and  },	or  if
       there is	no terminating }, an error occurs.

       Characters whose	value is less than 256 can be defined by either	of the
       two  syntaxes  for  \x. There is	no difference in the way they are han-
       dled. For example, \xdc is exactly the same as \x{dc}.

       Constraints on character	values

       Characters that are specified using octal or  hexadecimal  numbers  are
       limited to certain values, as follows:

	 8-bit non-UTF mode:
	   < 0x100

	 8-bit UTF-8 mode:
	   < 0x10ffff and a valid codepoint

       Invalid	Unicode	 codepoints  are  the  range 0xd800 to 0xdfff (the so-
       called "surrogate" codepoints), and 0xffef.

       Escape sequences	in character classes

       All the sequences that define a single character	value can be used both
       inside and outside character classes. Also, inside a  character	class,
       \b is interpreted as the	backspace character (hex 08).

       \N  is not allowed in a character class.	\B, \R,	and \X are not special
       inside a	character class. Like  other  unrecognized  escape  sequences,
       they are	treated	as the literal characters "B", "R", and	"X". Outside a
       character class,	these sequences	have different meanings.

       Unsupported Escape Sequences

       In  Perl, the sequences \l, \L, \u, and \U are recognized by its	string
       handler and used	to modify the case of following	characters. PCRE  does
       not support these escape	sequences.

       Absolute	and Relative Back References

       The  sequence  \g followed by an	unsigned or a negative number, option-
       ally enclosed in	braces,	is an absolute or relative back	 reference.  A
       named back reference can	be coded as \g{name}. Back references are dis-
       cussed later, following the discussion of parenthesized subpatterns.

       Absolute	and Relative Subroutine	Calls

       For  compatibility with Oniguruma, the non-Perl syntax \g followed by a
       name or a number	enclosed either	in angle brackets or single quotes, is
       alternative syntax for referencing a subpattern as a "subroutine".  De-
       tails  are  discussed  later.  Notice  that  \g{...}  (Perl syntax) and
       \g<...> (Oniguruma syntax) are not synonymous. The  former  is  a  back
       reference and the latter	is a subroutine	call.

       Generic Character Types

       Another use of backslash	is for specifying generic character types:

	 \d:
	   Any decimal digit

	 \D:
	   Any character that is not a decimal digit

	 \h:
	   Any horizontal whitespace character

	 \H:
	   Any character that is not a horizontal whitespace character

	 \s:
	   Any whitespace character

	 \S:
	   Any character that is not a whitespace character

	 \v:
	   Any vertical	whitespace character

	 \V:
	   Any character that is not a vertical	whitespace character

	 \w:
	   Any "word" character

	 \W:
	   Any "non-word" character

       There is	also the single	sequence \N, which matches a non-newline char-
       acter.  This  is	 the  same as the "." metacharacter when dotall	is not
       set. Perl also uses \N to match characters by name, but PCRE  does  not
       support this.

       Each  pair  of  lowercase and uppercase escape sequences	partitions the
       complete	set of characters into two disjoint sets. Any given  character
       matches	one, and only one, of each pair. The sequences can appear both
       inside and outside character classes. They each match one character  of
       the  appropriate	 type.	If the current matching	point is at the	end of
       the subject string, all fail, as	there is no character to match.

       For compatibility with Perl, \s did not used to match the VT  character
       (code  11),  which  made	it different from the the POSIX	"space"	class.
       However,	Perl added VT at release 5.18, and PCRE	followed suit  at  re-
       lease 8.34. The default \s characters are now HT	(9), LF	(10), VT (11),
       FF  (12),  CR (13), and space (32), which are defined as	white space in
       the "C" locale. This list may vary if locale-specific matching is  tak-
       ing  place. For example,	in some	locales	the "non-breaking space" char-
       acter (\xA0) is recognized as white space, and in others	the VT charac-
       ter is not.

       A "word"	character is an	underscore or any character that is  a	letter
       or  a  digit.  By default, the definition of letters and	digits is con-
       trolled by the PCRE low-valued character	tables,	in Erlang's case  (and
       without option unicode),	the ISO	Latin-1	character set.

       By default, in unicode mode, characters with values > 255, that is, all
       characters  outside  the	ISO Latin-1 character set, never match \d, \s,
       or \w, and always match \D, \S, and \W. These  sequences	 retain	 their
       original	meanings from before UTF support was available,	mainly for ef-
       ficiency	 reasons.  However,  if	 option	 ucp  is  set, the behavior is
       changed so that Unicode properties  are	used  to  determine  character
       types, as follows:

	 \d:
	   Any character that \p{Nd} matches (decimal digit)

	 \s:
	   Any character that \p{Z} or \h or \v

	 \w:
	   Any character that matches \p{L} or \p{N} matches, plus underscore

       The uppercase escapes match the inverse sets of characters. Notice that
       \d matches only decimal digits, while \w	matches	any Unicode digit, any
       Unicode letter, and underscore. Notice also that	ucp affects \b and \B,
       as  they	are defined in terms of	\w and \W. Matching these sequences is
       noticeably slower when ucp is set.

       The sequences \h, \H, \v, and \V	are features that were added  to  Perl
       in  release  5.10. In contrast to the other sequences, which match only
       ASCII characters	by default, these  always  match  certain  high-valued
       code points, regardless if ucp is set.

       The following are the horizontal	space characters:

	 U+0009:
	   Horizontal tab (HT)

	 U+0020:
	   Space

	 U+00A0:
	   Non-break space

	 U+1680:
	   Ogham space mark

	 U+180E:
	   Mongolian vowel separator

	 U+2000:
	   En quad

	 U+2001:
	   Em quad

	 U+2002:
	   En space

	 U+2003:
	   Em space

	 U+2004:
	   Three-per-em	space

	 U+2005:
	   Four-per-em space

	 U+2006:
	   Six-per-em space

	 U+2007:
	   Figure space

	 U+2008:
	   Punctuation space

	 U+2009:
	   Thin	space

	 U+200A:
	   Hair	space

	 U+202F:
	   Narrow no-break space

	 U+205F:
	   Medium mathematical space

	 U+3000:
	   Ideographic space

       The following are the vertical space characters:

	 U+000A:
	   Line	feed (LF)

	 U+000B:
	   Vertical tab	(VT)

	 U+000C:
	   Form	feed (FF)

	 U+000D:
	   Carriage return (CR)

	 U+0085:
	   Next	line (NEL)

	 U+2028:
	   Line	separator

	 U+2029:
	   Paragraph separator

       In  8-bit,  non-UTF-8  mode, only the characters	with code points < 256
       are relevant.

       Newline Sequences

       Outside a character class, by default, the escape sequence  \R  matches
       any  Unicode  newline  sequence.	In non-UTF-8 mode, \R is equivalent to
       the following:

       (?>\r\n|\n|\x0b|\f|\r|\x85)

       This is an example of an	"atomic	group",	details	are provided below.

       This particular group matches either the	two-character sequence CR fol-
       lowed by	LF, or one of the single characters LF (line feed, U+000A), VT
       (vertical tab, U+000B), FF (form	feed, U+000C),	CR  (carriage  return,
       U+000D),	 or  NEL  (next	 line,	U+0085). The two-character sequence is
       treated as a single unit	that cannot be split.

       In Unicode mode,	two more characters whose code points are  >  255  are
       added:  LS  (line  separator,  U+2028)  and  PS	(paragraph  separator,
       U+2029).	Unicode	character property support is  not  needed  for	 these
       characters to be	recognized.

       \R can be restricted to match only CR, LF, or CRLF (instead of the com-
       plete set of Unicode line endings) by setting option bsr_anycrlf	either
       at  compile time	or when	the pattern is matched.	(BSR is	an acronym for
       "backslash R".) This can	be made	the default when PCRE is built;	if so,
       the other behavior can be requested through option  bsr_unicode.	 These
       settings	can also be specified by starting a pattern string with	one of
       the following sequences:

	 (*BSR_ANYCRLF):
	   CR, LF, or CRLF only

	 (*BSR_UNICODE):
	   Any Unicode newline sequence

       These  override	the default and	the options specified to the compiling
       function, but they can themselves be overridden by options specified to
       a matching function. Notice that	these special settings,	which are  not
       Perl-compatible,	 are  recognized  only at the very start of a pattern,
       and that	they must be in	upper case.  If	 more  than  one  of  them  is
       present,	 the  last  one	is used. They can be combined with a change of
       newline convention; for example,	a pattern can start with:

       (*ANY)(*BSR_ANYCRLF)

       They can	also be	combined with the (*UTF8), (*UTF), or  (*UCP)  special
       sequences.  Inside  a character class, \R is treated as an unrecognized
       escape sequence,	and so matches the letter "R" by default.

       Unicode Character Properties

       Three more escape sequences that	match characters with specific proper-
       ties are	available. When	in 8-bit non-UTF-8 mode, these	sequences  are
       limited	to testing characters whose code points	are < 256, but they do
       work in this mode. The following	are the	extra escape sequences:

	 \p{xx}:
	   A character with property xx

	 \P{xx}:
	   A character without property	xx

	 \X:
	   A Unicode extended grapheme cluster

       The property names represented by xx above are limited to  the  Unicode
       script names, the general category properties, "Any", which matches any
       character  (including  newline),	 and some special PCRE properties (de-
       scribed in the next section). Other Perl	properties, such  as  "InMusi-
       calSymbols",  are  currently not	supported by PCRE. Notice that \P{Any}
       does not	match any characters and always	causes a match failure.

       Sets of Unicode characters are defined as belonging to certain scripts.
       A character from	one of these sets can be matched using a script	 name,
       for example:

       \p{Greek} \P{Han}

       Those  that are not part	of an identified script	are lumped together as
       "Common". The following is the current list of scripts:

	 * Arabic

	 * Armenian

	 * Avestan

	 * Balinese

	 * Bamum

	 * Bassa_Vah

	 * Batak

	 * Bengali

	 * Bopomofo

	 * Braille

	 * Buginese

	 * Buhid

	 * Canadian_Aboriginal

	 * Carian

	 * Caucasian_Albanian

	 * Chakma

	 * Cham

	 * Cherokee

	 * Common

	 * Coptic

	 * Cuneiform

	 * Cypriot

	 * Cyrillic

	 * Deseret

	 * Devanagari

	 * Duployan

	 * Egyptian_Hieroglyphs

	 * Elbasan

	 * Ethiopic

	 * Georgian

	 * Glagolitic

	 * Gothic

	 * Grantha

	 * Greek

	 * Gujarati

	 * Gurmukhi

	 * Han

	 * Hangul

	 * Hanunoo

	 * Hebrew

	 * Hiragana

	 * Imperial_Aramaic

	 * Inherited

	 * Inscriptional_Pahlavi

	 * Inscriptional_Parthian

	 * Javanese

	 * Kaithi

	 * Kannada

	 * Katakana

	 * Kayah_Li

	 * Kharoshthi

	 * Khmer

	 * Khojki

	 * Khudawadi

	 * Lao

	 * Latin

	 * Lepcha

	 * Limbu

	 * Linear_A

	 * Linear_B

	 * Lisu

	 * Lycian

	 * Lydian

	 * Mahajani

	 * Malayalam

	 * Mandaic

	 * Manichaean

	 * Meetei_Mayek

	 * Mende_Kikakui

	 * Meroitic_Cursive

	 * Meroitic_Hieroglyphs

	 * Miao

	 * Modi

	 * Mongolian

	 * Mro

	 * Myanmar

	 * Nabataean

	 * New_Tai_Lue

	 * Nko

	 * Ogham

	 * Ol_Chiki

	 * Old_Italic

	 * Old_North_Arabian

	 * Old_Permic

	 * Old_Persian

	 * Oriya

	 * Old_South_Arabian

	 * Old_Turkic

	 * Osmanya

	 * Pahawh_Hmong

	 * Palmyrene

	 * Pau_Cin_Hau

	 * Phags_Pa

	 * Phoenician

	 * Psalter_Pahlavi

	 * Rejang

	 * Runic

	 * Samaritan

	 * Saurashtra

	 * Sharada

	 * Shavian

	 * Siddham

	 * Sinhala

	 * Sora_Sompeng

	 * Sundanese

	 * Syloti_Nagri

	 * Syriac

	 * Tagalog

	 * Tagbanwa

	 * Tai_Le

	 * Tai_Tham

	 * Tai_Viet

	 * Takri

	 * Tamil

	 * Telugu

	 * Thaana

	 * Thai

	 * Tibetan

	 * Tifinagh

	 * Tirhuta

	 * Ugaritic

	 * Vai

	 * Warang_Citi

	 * Yi

       Each character has exactly one Unicode general category property, spec-
       ified by	a two-letter acronym. For compatibility	 with  Perl,  negation
       can  be	specified  by including	a circumflex between the opening brace
       and the property	name. For example, \p{^Lu} is the same as \P{Lu}.

       If only one letter is specified with \p or \P, it includes all the gen-
       eral category properties	that start with	that letter. In	this case,  in
       the  absence of negation, the curly brackets in the escape sequence are
       optional. The following two examples have the same effect:

       \p{L}
       \pL

       The following general category property codes are supported:

	 C:
	   Other

	 Cc:
	   Control

	 Cf:
	   Format

	 Cn:
	   Unassigned

	 Co:
	   Private use

	 Cs:
	   Surrogate

	 L:
	   Letter

	 Ll:
	   Lowercase letter

	 Lm:
	   Modifier letter

	 Lo:
	   Other letter

	 Lt:
	   Title case letter

	 Lu:
	   Uppercase letter

	 M:
	   Mark

	 Mc:
	   Spacing mark

	 Me:
	   Enclosing mark

	 Mn:
	   Non-spacing mark

	 N:
	   Number

	 Nd:
	   Decimal number

	 Nl:
	   Letter number

	 No:
	   Other number

	 P:
	   Punctuation

	 Pc:
	   Connector punctuation

	 Pd:
	   Dash	punctuation

	 Pe:
	   Close punctuation

	 Pf:
	   Final punctuation

	 Pi:
	   Initial punctuation

	 Po:
	   Other punctuation

	 Ps:
	   Open	punctuation

	 S:
	   Symbol

	 Sc:
	   Currency symbol

	 Sk:
	   Modifier symbol

	 Sm:
	   Mathematical	symbol

	 So:
	   Other symbol

	 Z:
	   Separator

	 Zl:
	   Line	separator

	 Zp:
	   Paragraph separator

	 Zs:
	   Space separator

       The special property L& is also supported. It matches a character  that
       has  the	 Lu, Ll, or Lt property, that is, a letter that	is not classi-
       fied as a modifier or "other".

       The Cs (Surrogate) property applies only	to  characters	in  the	 range
       U+D800 to U+DFFF. Such characters are invalid in	Unicode	strings	and so
       cannot be tested	by PCRE. Perl does not support the Cs property.

       The long	synonyms for property names supported by Perl (such as \p{Let-
       ter})  are  not supported by PCRE. It is	not permitted to prefix	any of
       these properties	with "Is".

       No character in the Unicode table has  the  Cn  (unassigned)  property.
       This  property is instead assumed for any code point that is not	in the
       Unicode table.

       Specifying caseless matching does not affect  these  escape  sequences.
       For example, \p{Lu} always matches only uppercase letters. This is dif-
       ferent from the behavior	of current versions of Perl.

       Matching	 characters by Unicode property	is not fast, as	PCRE must do a
       multistage table	lookup to find a character property. That is  why  the
       traditional escape sequences such as \d and \w do not use Unicode prop-
       erties  in PCRE by default. However, you	can make them do so by setting
       option ucp or by	starting the pattern with (*UCP).

       Extended	Grapheme Clusters

       The \X escape matches any number	of Unicode  characters	that  form  an
       "extended grapheme cluster", and	treats the sequence as an atomic group
       (see below). Up to and including	release	8.31, PCRE matched an earlier,
       simpler	definition  that  was  equivalent  to (?>\PM\pM*). That	is, it
       matched a character without the "mark" property,	followed  by  zero  or
       more  characters	 with  the "mark" property. Characters with the	"mark"
       property	are typically non-spacing accents that	affect	the  preceding
       character.

       This  simple definition was extended in Unicode to include more compli-
       cated kinds of composite	character by giving each character a  grapheme
       breaking	 property, and creating	rules that use these properties	to de-
       fine the	boundaries of extended grapheme	 clusters.  In	PCRE  releases
       later than 8.31,	\X matches one of these	clusters.

       \X  always  matches  at least one character. Then it decides whether to
       add more	characters according to	the following rules for	ending a clus-
       ter:

	 * End at the end of the subject string.

	 * Do not end between CR and LF; otherwise end after any control char-
	   acter.

	 * Do not break	Hangul (a Korean script)  syllable  sequences.	Hangul
	   characters  are of five types: L, V,	T, LV, and LVT.	An L character
	   can be followed by an L, V, LV, or LVT character. An	LV or V	 char-
	   acter  can be followed by a V or T character. An LVT	or T character
	   can be followed only	by a T character.

	 * Do not end before extending characters or spacing marks. Characters
	   with	the "mark" property always have	the "extend" grapheme breaking
	   property.

	 * Do not end after prepend characters.

	 * Otherwise, end the cluster.

       PCRE Additional Properties

       In addition to the standard Unicode properties described	earlier,  PCRE
       supports	 four more that	make it	possible to convert traditional	escape
       sequences, such as \w and \s to use Unicode properties. PCRE uses these
       non-standard, non-Perl properties internally when  the  ucp  option  is
       passed.	However,  they can also	be used	explicitly. The	properties are
       as follows:

	 Xan:
	   Any alphanumeric character. Matches characters that have either the
	   L (letter) or the N (number)	property.

	 Xps:
	   Any Posix space character. Matches the characters tab,  line	 feed,
	   vertical  tab,  form	feed, carriage return, and any other character
	   that	has the	Z (separator) property.

	 Xsp:
	   Any Perl space character. Matches the same as Xps, except that ver-
	   tical tab is	excluded.

	 Xwd:
	   Any Perl "word" character. Matches the same characters as Xan, plus
	   underscore.

       Perl and	POSIX space are	now the	same. Perl added VT to its space char-
       acter set at release 5.18 and PCRE changed at release 8.34.

       Xan matches characters that have	either the L (letter) or the  N	 (num-
       ber)  property. Xps matches the characters tab, linefeed, vertical tab,
       form feed, or carriage return, and any other character that has	the  Z
       (separator) property. Xsp is the	same as	Xps; it	used to	exclude	verti-
       cal tab,	for Perl compatibility,	but Perl changed, and so PCRE followed
       at  release  8.34.  Xwd matches the same	characters as Xan, plus	under-
       score.

       There is	another	non-standard property, Xuc, which matches any  charac-
       ter  that  can  be represented by a Universal Character Name in C++ and
       other programming languages. These are the characters $,	 @,  `	(grave
       accent),	 and all characters with Unicode code points >=	U+00A0,	except
       for the surrogates U+D800 to U+DFFF.  Notice  that  most	 base  (ASCII)
       characters  are	excluded.  (Universal  Character Names are of the form
       \uHHHH or \UHHHHHHHH, where H is	a hexadecimal digit. Notice  that  the
       Xuc  property  does  not	 match these sequences but the characters that
       they represent.)

       Resetting the Match Start

       The escape sequence \K causes any previously matched characters not  to
       be  included  in	the final matched sequence. For	example, the following
       pattern matches "foobar", but reports that it has matched "bar":

       foo\Kbar

       This feature is similar to a lookbehind	assertion  (described  below).
       However,	 in  this  case, the part of the subject before	the real match
       does not	have to	be of fixed length, as lookbehind assertions  do.  The
       use  of	\K does	not interfere with the setting of captured substrings.
       For example, when the following pattern	matches	 "foobar",  the	 first
       substring is still set to "foo":

       (foo)\Kbar

       Perl  documents	that  the use of \K within assertions is "not well de-
       fined". In PCRE,	\K is acted upon when it occurs	inside positive	asser-
       tions, but is ignored in	negative assertions. Note that when a  pattern
       such  as	 (?=ab\K)  matches,  the  reported  start  of the match	can be
       greater than the	end of the match.

       Simple Assertions

       The final use of	backslash is for certain simple	assertions. An	asser-
       tion  specifies a condition that	must be	met at a particular point in a
       match, without consuming	any characters from the	 subject  string.  The
       use  of subpatterns for more complicated	assertions is described	below.
       The following are the backslashed assertions:

	 \b:
	   Matches at a	word boundary.

	 \B:
	   Matches when	not at a word boundary.

	 \A:
	   Matches at the start	of the subject.

	 \Z:
	   Matches at the end of the subject, and before a newline at the  end
	   of the subject.

	 \z:
	   Matches only	at the end of the subject.

	 \G:
	   Matches at the first	matching position in the subject.

       Inside  a  character  class, \b has a different meaning;	it matches the
       backspace character. If any other of  these  assertions	appears	 in  a
       character  class, by default it matches the corresponding literal char-
       acter (for example, \B matches the letter B).

       A word boundary is a position in	the subject string where  the  current
       character  and  the previous character do not both match	\w or \W (that
       is, one matches \w and the other	matches	\W), or	the start  or  end  of
       the  string if the first	or last	character matches \w, respectively. In
       UTF mode, the meanings of \w and	\W can be changed  by  setting	option
       ucp. When this is done, it also affects \b and \B. PCRE and Perl	do not
       have a separate "start of word" or "end of word"	metasequence. However,
       whatever	 follows  \b normally determines which it is. For example, the
       fragment	\ba matches "a"	at the start of	a word.

       The \A, \Z, and \z assertions differ from  the  traditional  circumflex
       and dollar (described in	the next section) in that they only ever match
       at  the	very start and end of the subject string, whatever options are
       set. Thus, they are independent of multiline mode. These	 three	asser-
       tions  are  not affected	by options notbol or noteol, which affect only
       the behavior of the circumflex and dollar metacharacters.  However,  if
       argument	 startoffset of	run/3 is non-zero, indicating that matching is
       to start	at a point other than the beginning of	the  subject,  \A  can
       never match. The	difference between \Z and \z is	that \Z	matches	before
       a  newline  at  the  end	 of  the  string and at	the very end, while \z
       matches only at the end.

       The \G assertion	is true	only when the current matching position	is  at
       the  start  point of the	match, as specified by argument	startoffset of
       run/3. It differs from \A when the value	of startoffset is non-zero. By
       calling run/3 multiple times with appropriate arguments,	you can	 mimic
       the  Perl  option /g, and it is in this kind of implementation where \G
       can be useful.

       Notice, however,	that the PCRE interpretation of	\G, as	the  start  of
       the  current  match, is subtly different	from Perl, which defines it as
       the end of the previous match. In Perl, these can be different when the
       previously matched string was empty. As PCRE does only one match	 at  a
       time, it	cannot reproduce this behavior.

       If  all	the alternatives of a pattern begin with \G, the expression is
       anchored	to the starting	match position,	and the	"anchored" flag	is set
       in the compiled regular expression.

CIRCUMFLEX AND DOLLAR
       The circumflex and dollar  metacharacters  are  zero-width  assertions.
       That  is,  they test for	a particular condition to be true without con-
       suming any characters from the subject string.

       Outside a character class, in the default matching mode,	the circumflex
       character is an assertion that is true only  if	the  current  matching
       point is	at the start of	the subject string. If argument	startoffset of
       run/3  is  non-zero,  circumflex	can never match	if option multiline is
       unset. Inside a character class,	circumflex has an  entirely  different
       meaning (see below).

       Circumflex  needs  not to be the	first character	of the pattern if some
       alternatives are	involved, but it is to be the first thing in each  al-
       ternative  in  which  it	 appears  if the pattern is ever to match that
       branch. If all possible alternatives start with a circumflex, that  is,
       if  the	pattern	 is constrained	to match only at the start of the sub-
       ject, it	is said	to be an "anchored" pattern.  (There  are  also	 other
       constructs that can cause a pattern to be anchored.)

       The  dollar  character is an assertion that is true only	if the current
       matching	point is at the	end of the subject string, or immediately  be-
       fore  a	newline	 at the	end of the string (by default).	Notice however
       that it does not	match the newline. Dollar needs	not  to	 be  the  last
       character  of  the pattern if some alternatives are involved, but it is
       to be the last item in any branch in which it appears.  Dollar  has  no
       special meaning in a character class.

       The  meaning  of	 dollar	 can be	changed	so that	it matches only	at the
       very end	of the string, by setting  option  dollar_endonly  at  compile
       time. This does not affect the \Z assertion.

       The meanings of the circumflex and dollar characters are	changed	if op-
       tion  multiline is set. When this is the	case, a	circumflex matches im-
       mediately after internal	newlines and  at  the  start  of  the  subject
       string.	It does	not match after	a newline that ends the	string.	A dol-
       lar matches before any newlines in the string, and  at  the  very  end,
       when  multiline	is set.	When newline is	specified as the two-character
       sequence	CRLF, isolated CR and LF characters do not indicate newlines.

       For example, the	pattern	/^abc$/	matches	the subject string  "def\nabc"
       (where  \n  represents a	newline) in multiline mode, but	not otherwise.
       So, patterns that are anchored in single-line mode because all branches
       start with ^ are	not anchored in	multiline mode,	and a match  for  cir-
       cumflex is possible when	argument startoffset of	run/3 is non-zero. Op-
       tion dollar_endonly is ignored if multiline is set.

       Notice that the sequences \A, \Z, and \z	can be used to match the start
       and  end	 of  the  subject  in both modes. If all branches of a pattern
       start with \A, it is always anchored, regardless	if multiline is	set.

FULL STOP (PERIOD, DOT)	AND .
       Outside a character class, a dot	in the pattern matches	any  character
       in  the	subject	 string	except (by default) a character	that signifies
       the end of a line.

       When a line ending is defined as	a single character, dot	never  matches
       that  character.	When the two-character sequence	CRLF is	used, dot does
       not match CR if it is immediately followed by LF, otherwise it  matches
       all  characters (including isolated CRs and LFs). When any Unicode line
       endings are recognized, dot does	not match CR, LF, or any of the	 other
       line-ending characters.

       The behavior of dot regarding newlines can be changed. If option	dotall
       is  set,	 a  dot	 matches any character,	without	exception. If the two-
       character sequence CRLF is present in the subject string, it takes  two
       dots to match it.

       The  handling of	dot is entirely	independent of the handling of circum-
       flex and	dollar,	the only relationship is that both  involve  newlines.
       Dot has no special meaning in a character class.

       The  escape  sequence  \N behaves like a	dot, except that it is not af-
       fected by option	PCRE_DOTALL. That is, it matches any character	except
       one  that signifies the end of a	line. Perl also	uses \N	to match char-
       acters by name but PCRE does not	support	this.

MATCHING A SINGLE DATA UNIT
       Outside a character class, the escape  sequence	\C  matches  any  data
       unit,  regardless  if a UTF mode	is set.	One data unit is one byte. Un-
       like a dot, \C always matches line-ending characters.  The  feature  is
       provided	in Perl	to match individual bytes in UTF-8 mode, but it	is un-
       clear  how it can usefully be used. As \C breaks	up characters into in-
       dividual	data units, matching one unit with \C in a UTF mode means that
       the remaining string can	start with a malformed UTF character. This has
       undefined results, as  PCRE  assumes  that  it  deals  with  valid  UTF
       strings.

       PCRE  does  not	allow \C to appear in lookbehind assertions (described
       below) in a UTF mode, as	this would make	it impossible to calculate the
       length of the lookbehind.

       The \C escape sequence is best avoided. However,	one way	 of  using  it
       that  avoids the	problem	of malformed UTF characters is to use a	looka-
       head to check the length	of the next character,	as  in	the  following
       pattern,	 which	can be used with a UTF-8 string	(ignore	whitespace and
       line breaks):

       (?| (?=[\x00-\x7f])(\C) |
	   (?=[\x80-\x{7ff}])(\C)(\C) |
	   (?=[\x{800}-\x{ffff}])(\C)(\C)(\C) |
	   (?=[\x{10000}-\x{1fffff}])(\C)(\C)(\C)(\C))

       A group that starts with	(?| resets the capturing  parentheses  numbers
       in each alternative (see	section	Duplicate Subpattern Numbers). The as-
       sertions	at the start of	each branch check the next UTF-8 character for
       values whose encoding uses 1, 2,	3, or 4	bytes, respectively. The indi-
       vidual bytes of the character are then captured by the appropriate num-
       ber of groups.

SQUARE BRACKETS	AND CHARACTER CLASSES
       An opening square bracket introduces a character	class, terminated by a
       closing square bracket. A closing square	bracket	on its own is not spe-
       cial  by	 default.  However, if option PCRE_JAVASCRIPT_COMPAT is	set, a
       lone closing square bracket causes a compile-time error.	If  a  closing
       square  bracket	is  required as	a member of the	class, it is to	be the
       first data character in the class  (after  an  initial  circumflex,  if
       present)	or escaped with	a backslash.

       A  character  class matches a single character in the subject. In a UTF
       mode, the character can be more than one	 data  unit  long.  A  matched
       character must be in the	set of characters defined by the class,	unless
       the  first  character in	the class definition is	a circumflex, in which
       case the	subject	character must not be in the set defined by the	class.
       If a circumflex is required as a	member of the class, ensure that it is
       not the first character,	or escape it with a backslash.

       For example, the	character class	[aeiou]	matches	any  lowercase	vowel,
       while [^aeiou] matches any character that is not	a lowercase vowel. No-
       tice that a circumflex is just a	convenient notation for	specifying the
       characters  that	 are in	the class by enumerating those that are	not. A
       class that starts with a	circumflex is not an assertion;	it still  con-
       sumes  a	 character  from the subject string, and therefore it fails if
       the current pointer is at the end of the	string.

       In UTF-8	mode, characters with values > 255 (0xffff) can	be included in
       a class as a literal string of data units, or by	using the \x{ escaping
       mechanism.

       When caseless matching is set, any letters in a	class  represent  both
       their uppercase and lowercase versions. For example, a caseless [aeiou]
       matches	"A" and	"a", and a caseless [^aeiou] does not match "A", but a
       caseful version would. In a UTF mode, PCRE always understands the  con-
       cept  of	case for characters whose values are < 256, so caseless	match-
       ing is always possible. For characters with higher values, the  concept
       of  case	 is  supported	only if	PCRE is	compiled with Unicode property
       support.	If you want to use caseless matching in	a UTF mode for charac-
       ters >=,	ensure that PCRE is compiled with Unicode property support and
       with UTF	support.

       Characters that can indicate line breaks	are never treated in any  spe-
       cial way	when matching character	classes, whatever line-ending sequence
       is  in use, and whatever	setting	of options PCRE_DOTALL and PCRE_MULTI-
       LINE is used. A class such as [^a] always matches one of	these  charac-
       ters.

       The  minus (hyphen) character can be used to specify a range of charac-
       ters in a character class. For example, [d-m] matches  any  letter  be-
       tween  d	and m, inclusive. If a minus character is required in a	class,
       it must be escaped with a backslash or appear in	a  position  where  it
       cannot  be interpreted as indicating a range, typically as the first or
       last character in the class, or immediately after a range. For example,
       [b-d-z] matches letters in the range b to d, a hyphen character,	or z.

       The literal character "]" cannot	be the end character  of  a  range.  A
       pattern	such  as  [W-]46]  is interpreted as a class of	two characters
       ("W" and	"-") followed by a literal string "46]",  so  it  would	 match
       "W46]"  or  "-46]".  However, if	"]" is escaped with a backslash, it is
       interpreted as the end of range,	so [W-\]46] is interpreted as a	 class
       containing a range followed by two other	characters. The	octal or hexa-
       decimal representation of "]" can also be used to end a range.

       An  error is generated if a POSIX character class (see below) or	an es-
       cape sequence other than	one that defines a single character appears at
       a point where a	range  ending  character  is  expected.	 For  example,
       [z-\xff]	is valid, but [A-\d] and [A-[:digit:]] are not.

       Ranges  operate in the collating	sequence of character values. They can
       also  be	 used  for  characters	specified  numerically,	 for  example,
       [\000-\037].  Ranges  can include any characters	that are valid for the
       current mode.

       If a range that includes	letters	is used	when caseless matching is set,
       it matches the letters in either	case. For example, [W-c] is equivalent
       to [][\\^_`wxyzabc], matched caselessly.	In a non-UTF mode, if  charac-
       ter tables for a	French locale are in use, [\xc8-\xcb] matches accented
       E  characters in	both cases. In UTF modes, PCRE supports	the concept of
       case for	characters with	values > 255 only when	it  is	compiled  with
       Unicode property	support.

       The  character escape sequences \d, \D, \h, \H, \p, \P, \s, \S, \v, \V,
       \w, and \W can appear in	a character class, and add the characters that
       they match to the class.	For example, [\dABCDEF]	matches	any  hexadeci-
       mal  digit. In UTF modes, option	ucp affects the	meanings of \d,	\s, \w
       and their uppercase partners, just as it	does when they appear  outside
       a character class, as described in section Generic Character Types ear-
       lier. The escape	sequence \b has	a different meaning inside a character
       class;  it  matches  the	backspace character. The sequences \B, \N, \R,
       and \X are not special inside a character class.	Like any other	unrec-
       ognized	escape	sequences,  they are treated as	the literal characters
       "B", "N", "R", and "X".

       A circumflex can	conveniently be	 used  with  the  uppercase  character
       types  to specify a more	restricted set of characters than the matching
       lowercase type. For example, class [^\W_] matches any letter or	digit,
       but  not	underscore, while [\w] includes	underscore. A positive charac-
       ter class is to be read as "something OR	something OR ..." and a	 nega-
       tive class as "NOT something AND	NOT something AND NOT ...".

       Only the	following metacharacters are recognized	in character classes:

	 * Backslash

	 * Hyphen (only	where it can be	interpreted as specifying a range)

	 * Circumflex (only at the start)

	 * Opening  square  bracket (only when it can be interpreted as	intro-
	   ducing a Posix class	name, or for a special compatibility  feature;
	   see the next	two sections)

	 * Terminating closing square bracket

       However,	escaping other non-alphanumeric	characters does	no harm.

POSIX CHARACTER	CLASSES
       Perl supports the Posix notation	for character classes. This uses names
       enclosed	 by  [:	and :] within the enclosing square brackets. PCRE also
       supports	this notation. For example, the	following  matches  "0",  "1",
       any alphabetic character, or "%":

       [01[:alpha:]%]

       The following are the supported class names:

	 alnum:
	   Letters and digits

	 alpha:
	   Letters

	 blank:
	   Space or tab	only

	 cntrl:
	   Control characters

	 digit:
	   Decimal digits (same	as \d)

	 graph:
	   Printing characters,	excluding space

	 lower:
	   Lowercase letters

	 print:
	   Printing characters,	including space

	 punct:
	   Printing characters,	excluding letters, digits, and space

	 space:
	   Whitespace (the same	as \s from PCRE	8.34)

	 upper:
	   Uppercase letters

	 word:
	   "Word" characters (same as \w)

	 xdigit:
	   Hexadecimal digits

       There  is  another  character  class,  ascii,  that erroneously matches
       Latin-1 characters instead of the 0-127 range specified by POSIX.  This
       cannot  be fixed	without	altering the behaviour of other	classes, so we
       recommend matching the range with [\\0-\x7f] instead.

       The default "space" characters are HT (9), LF (10), VT (11),  FF	 (12),
       CR  (13),  and space (32). If locale-specific matching is taking	place,
       the list	of space characters may	be different; there may	 be  fewer  or
       more of them. "Space" used to be	different to \s, which did not include
       VT,  for	Perl compatibility. However, Perl changed at release 5.18, and
       PCRE followed at	release	8.34. "Space" and \s now match the same	set of
       characters.

       The name	"word" is a Perl extension, and	"blank"	 is  a	GNU  extension
       from  Perl  5.8.	Another	Perl extension is negation, which is indicated
       by a ^ character	after the colon. For example,  the  following  matches
       "1", "2", or any	non-digit:

       [12[:^digit:]]

       PCRE (and Perl) also recognize the Posix	syntax [.ch.] and [=ch=] where
       "ch"  is	a "collating element", but these are not supported, and	an er-
       ror is given if they are	encountered.

       By default, characters with values > 255	do not match any of the	 Posix
       character  classes.  However, if	option PCRE_UCP	is passed to pcre_com-
       pile(), some of the classes are changed so that Unicode character prop-
       erties are used.	This is	achieved by replacing certain Posix classes by
       other sequences,	as follows:

	 [:alnum:]:
	   Becomes \p{Xan}

	 [:alpha:]:
	   Becomes \p{L}

	 [:blank:]:
	   Becomes \h

	 [:digit:]:
	   Becomes \p{Nd}

	 [:lower:]:
	   Becomes \p{Ll}

	 [:space:]:
	   Becomes \p{Xps}

	 [:upper:]:
	   Becomes \p{Lu}

	 [:word:]:
	   Becomes \p{Xwd}

       Negated versions, such as [:^alpha:], use \P instead of \p. Three other
       POSIX classes are handled specially in UCP mode:

	 [:graph:]:
	   This	matches	characters that	have glyphs that mark  the  page  when
	   printed.  In	Unicode	property terms,	it matches all characters with
	   the L, M, N,	P, S, or Cf properties,	except for:

	   U+061C:
	     Arabic Letter Mark

	   U+180E:
	     Mongolian Vowel Separator

	   U+2066 - U+2069:
	     Various "isolate"s

	 [:print:]:
	   This	matches	the same characters as [:graph:] plus space characters
	   that	are not	controls, that is, characters with the Zs property.

	 [:punct:]:
	   This	matches	all characters that have the Unicode  P	 (punctuation)
	   property, plus those	characters whose code points are less than 128
	   that	have the S (Symbol) property.

       The  other  POSIX classes are unchanged,	and match only characters with
       code points less	than 128.

       Compatibility Feature for Word Boundaries

       In the POSIX.2 compliant	library	that was included in 4.4BSD Unix,  the
       ugly  syntax  [[:<:]]  and [[:>:]] is used for matching "start of word"
       and "end	of word". PCRE treats these items as follows:

	 [[:<:]]:
	   is converted	to \b(?=\w)

	 [[:>:]]:
	   is converted	to \b(?<=\w)

       Only these exact	character sequences are	recognized. A sequence such as
       [a[:<:]b] provokes error	for an unrecognized  POSIX  class  name.  This
       support	is not compatible with Perl. It	is provided to help migrations
       from other environments,	and is best not	used in	any new	patterns. Note
       that \b matches at the start and	the end	of a word (see "Simple	asser-
       tions"  above),	and in a Perl-style pattern the	preceding or following
       character normally shows	which is wanted, without the need for the  as-
       sertions	 that are used above in	order to give exactly the POSIX	behav-
       iour.

VERTICAL BAR
       Vertical	bar characters are used	to separate alternative	patterns.  For
       example,	the following pattern matches either "gilbert" or "sullivan":

       gilbert|sullivan

       Any number of alternatives can appear, and an empty alternative is per-
       mitted (matching	the empty string). The matching	process	tries each al-
       ternative  in  turn, from left to right,	and the	first that succeeds is
       used. If	the alternatives are within a subpattern (defined  in  section
       Subpatterns),  "succeeds" means matching	the remaining main pattern and
       the alternative in the subpattern.

INTERNAL OPTION	SETTING
       The  settings  of  the  Perl-compatible	options	 caseless,  multiline,
       dotall,	and  extended  can be changed from within the pattern by a se-
       quence of Perl option letters enclosed between "(?" and ")". The	option
       letters are as follows:

	 i:
	   For caseless

	 m:
	   For multiline

	 s:
	   For dotall

	 x:
	   For extended

       For example, (?im) sets caseless, multiline matching. These options can
       also be unset by	preceding the letter with a hyphen. A combined setting
       and unsetting such as (?im-sx),	which  sets  caseless  and  multiline,
       while unsetting dotall and extended, is also permitted. If a letter ap-
       pears both before and after the hyphen, the option is unset.

       The  PCRE-specific options dupnames, ungreedy, and extra	can be changed
       in the same way as the Perl-compatible options by using the  characters
       J, U, and X respectively.

       When  one of these option changes occurs	at top-level (that is, not in-
       side subpattern parentheses), the change	applies	to  the	 remainder  of
       the pattern that	follows.

       An  option change within	a subpattern (see section Subpatterns) affects
       only that part of the subpattern	that follows  it.  So,	the  following
       matches	abc  and  aBc  and  no other strings (assuming caseless	is not
       used):

       (a(?i)b)c

       By this means, options can be made to have different settings  in  dif-
       ferent  parts  of  the  pattern.	Any changes made in one	alternative do
       carry on	into subsequent	branches within	the same subpattern. For exam-
       ple:

       (a(?i)b|c)

       matches "ab", "aB", "c",	and "C", although when matching	"C" the	 first
       branch  is abandoned before the option setting. This is because the ef-
       fects of	option settings	occur at compile time.	There  would  be  some
       weird behavior otherwise.

   Note:
       Other PCRE-specific options can be set by the application when the com-
       piling or matching functions are	called.	Sometimes the pattern can con-
       tain  special  leading sequences, such as (*CRLF), to override what the
       application has set or what has been defaulted. Details are provided in
       section	Newline	Sequences earlier.

       The (*UTF8) and (*UCP) leading sequences	can be used  to	 set  UTF  and
       Unicode	property modes.	They are equivalent to setting options unicode
       and ucp,	respectively. The (*UTF) sequence is a	generic	 version  that
       can be used with	any of the libraries. However, the application can set
       option never_utf, which locks out the use of the	(*UTF) sequences.

SUBPATTERNS
       Subpatterns are delimited by parentheses	(round brackets), which	can be
       nested. Turning part of a pattern into a	subpattern does	two things:

	 1.:
	   It localizes	a set of alternatives. For example, the	following pat-
	   tern	matches	"cataract", "caterpillar", or "cat":

	 cat(aract|erpillar|)

	   Without  the	parentheses, it	would match "cataract",	"erpillar", or
	   an empty string.

	 2.:
	   It sets up the subpattern as	a capturing subpattern.	That is,  when
	   the	complete  pattern  matches, that portion of the	subject	string
	   that	matched	the subpattern is passed back to  the  caller  through
	   the return value of run/3.

       Opening parentheses are counted from left to right (starting from 1) to
       obtain  numbers	for  the  capturing  subpatterns.  For example,	if the
       string "the red king" is	matched	against	 the  following	 pattern,  the
       captured	substrings are "red king", "red", and "king", and are numbered
       1, 2, and 3, respectively:

       the ((red|white)	(king|queen))

       It  is not always helpful that plain parentheses	fulfill	two functions.
       Often a grouping	subpattern is required without	a  capturing  require-
       ment.  If  an  opening parenthesis is followed by a question mark and a
       colon, the subpattern does not do any capturing,	 and  is  not  counted
       when  computing the number of any subsequent capturing subpatterns. For
       example,	if the string "the white queen"	is matched against the follow-
       ing pattern, the	captured substrings are	"white queen" and "queen", and
       are numbered 1 and 2:

       the ((?:red|white) (king|queen))

       The maximum number of capturing subpatterns is 65535.

       As a convenient shorthand, if any option	settings are required  at  the
       start  of a non-capturing subpattern, the option	letters	can appear be-
       tween "?" and ":". Thus,	the following two patterns match the same  set
       of strings:

       (?i:saturday|sunday)
       (?:(?i)saturday|sunday)

       As  alternative	branches are tried from	left to	right, and options are
       not reset until the end of the subpattern is reached, an	option setting
       in one branch does affect subsequent branches, so  the  above  patterns
       match both "SUNDAY" and "Saturday".

DUPLICATE SUBPATTERN NUMBERS
       Perl  5.10  introduced a	feature	where each alternative in a subpattern
       uses the	same numbers for its capturing parentheses. Such a  subpattern
       starts  with (?|	and is itself a	non-capturing subpattern. For example,
       consider	the following pattern:

       (?|(Sat)ur|(Sun))day

       As the two alternatives are inside a (?|	group, both sets of  capturing
       parentheses  are	 numbered one. Thus, when the pattern matches, you can
       look at captured	substring number one, whichever	 alternative  matched.
       This  construct is useful when you want to capture a part, but not all,
       of one of many alternatives. Inside a (?| group,	parentheses  are  num-
       bered  as  usual,  but the number is reset at the start of each branch.
       The numbers of any capturing parentheses	 that  follow  the  subpattern
       start  after the	highest	number used in any branch. The following exam-
       ple is from the Perl documentation;  the	 numbers  underneath  show  in
       which buffer the	captured content is stored:

       # before	 ---------------branch-reset-----------	after
       / ( a )	(?| x (	y ) z |	(p (q) r) | (t)	u (v) )	( z ) /x
       # 1	      2		2  3	    2	  3	4

       A  back	reference  to a	numbered subpattern uses the most recent value
       that is set for that number by any subpattern.  The  following  pattern
       matches "abcabc"	or "defdef":

       /(?|(abc)|(def))\1/

       In  contrast,  a	subroutine call	to a numbered subpattern always	refers
       to the first one	in the pattern with the	given  number.	The  following
       pattern matches "abcabc"	or "defabc":

       /(?|(abc)|(def))(?1)/

       If  a  condition	 test for a subpattern having matched refers to	a non-
       unique number, the test is true if any of the subpatterns of that  num-
       ber have	matched.

       An alternative approach using this "branch reset" feature is to use du-
       plicate named subpatterns, as described in the next section.

NAMED SUBPATTERNS
       Identifying  capturing  parentheses  by number is simple, but it	can be
       hard to keep track of the numbers in complicated	 regular  expressions.
       Also,  if  an  expression  is modified, the numbers can change. To help
       with this difficulty, PCRE supports the	naming	of  subpatterns.  This
       feature	was  not added to Perl until release 5.10. Python had the fea-
       ture earlier, and PCRE introduced it at release 4.0, using  the	Python
       syntax. PCRE now	supports both the Perl and the Python syntax. Perl al-
       lows identically	numbered subpatterns to	have different names, but PCRE
       does not.

       In  PCRE,  a subpattern can be named in one of three ways: (?<name>...)
       or (?'name'...) as in Perl, or (?P<name>...) as in  Python.  References
       to  capturing parentheses from other parts of the pattern, such as back
       references, recursion, and conditions, can be made by name and by  num-
       ber.

       Names  consist of up to 32 alphanumeric characters and underscores, but
       must start with a non-digit. Named capturing parentheses	are still  al-
       located	numbers	 as  well  as  names, exactly as if the	names were not
       present.	The capture specification to run/3 can	use  named  values  if
       they are	present	in the regular expression.

       By default, a name must be unique within	a pattern, but this constraint
       can  be	relaxed	by setting option dupnames at compile time. (Duplicate
       names are also always permitted for subpatterns with the	 same  number,
       set  up	as  described in the previous section.)	Duplicate names	can be
       useful for patterns where only one instance of  the  named  parentheses
       can match. Suppose that you want	to match the name of a weekday,	either
       as  a  3-letter abbreviation or as the full name, and in	both cases you
       want to extract the abbreviation. The following pattern	(ignoring  the
       line breaks) does the job:

       (?<DN>Mon|Fri|Sun)(?:day)?|
       (?<DN>Tue)(?:sday)?|
       (?<DN>Wed)(?:nesday)?|
       (?<DN>Thu)(?:rsday)?|
       (?<DN>Sat)(?:urday)?

       There  are  five	capturing substrings, but only one is ever set after a
       match. (An alternative way of solving this problem is to	use a  "branch
       reset" subpattern, as described in the previous section.)

       For  capturing  named subpatterns which names are not unique, the first
       matching	occurrence (counted from left to right in the subject) is  re-
       turned  from  run/3, if the name	is specified in	the values part	of the
       capture statement. The all_names	capturing value	matches	all the	 names
       in the same way.

   Note:
       You  cannot  use	different names	to distinguish between two subpatterns
       with the	same number, as	PCRE uses only the numbers when	matching.  For
       this  reason,  an error is given	at compile time	if different names are
       specified to subpatterns	with the same number. However, you can specify
       the same	name to	subpatterns with the same number, even	when  dupnames
       is not set.

REPETITION
       Repetition  is  specified  by  quantifiers, which can follow any	of the
       following items:

	 * A literal data character

	 * The dot metacharacter

	 * The \C escape sequence

	 * The \X escape sequence

	 * The \R escape sequence

	 * An escape such as \d	or \pL that matches a single character

	 * A character class

	 * A back reference (see the next section)

	 * A parenthesized subpattern (including assertions)

	 * A subroutine	call to	a subpattern (recursive	or otherwise)

       The general repetition quantifier specifies a minimum and maximum  num-
       ber  of	permitted matches, by giving the two numbers in	curly brackets
       (braces), separated by a	comma. The numbers must	be <  65536,  and  the
       first  must  be less than or equal to the second. For example, the fol-
       lowing matches "zz", "zzz", or "zzzz":

       z{2,4}

       A closing brace on its own is not a special character.  If  the	second
       number  is  omitted, but	the comma is present, there is no upper	limit.
       If the second number and	the comma are  both  omitted,  the  quantifier
       specifies  an  exact  number  of	 required matches. Thus, the following
       matches at least	three successive vowels, but can match many more:

       [aeiou]{3,}

       The following matches exactly eight digits:

       \d{8}

       An opening curly	bracket	that appears in	a position where a  quantifier
       is  not allowed,	or one that does not match the syntax of a quantifier,
       is taken	as a literal character.	For example, {,6} is not a quantifier,
       but a literal string of four characters.

       In Unicode mode,	quantifiers apply to characters	rather than  to	 indi-
       vidual  data  units.  Thus, for example,	\x{100}{2} matches two charac-
       ters, each of which is represented by a	2-byte	sequence  in  a	 UTF-8
       string.	Similarly, \X{3} matches three Unicode extended	grapheme clus-
       ters, each of which can be many data units long (and  they  can	be  of
       different lengths).

       The quantifier {0} is permitted,	causing	the expression to behave as if
       the previous item and the quantifier were not present. This can be use-
       ful  for	 subpatterns that are referenced as subroutines	from elsewhere
       in the pattern (but see also section  Defining Subpatterns for  Use  by
       Reference  Only).  Items	other than subpatterns that have a {0} quanti-
       fier are	omitted	from the compiled pattern.

       For convenience,	the three most common quantifiers have	single-charac-
       ter abbreviations:

	 *:
	   Equivalent to {0,}

	 +:
	   Equivalent to {1,}

	 ?:
	   Equivalent to {0,1}

       Infinite	 loops	can  be	constructed by following a subpattern that can
       match no	characters with	a quantifier that has no upper limit, for  ex-
       ample:

       (a?)*

       Earlier versions	of Perl	and PCRE used to give an error at compile time
       for  such  patterns. However, as	there are cases	where this can be use-
       ful, such patterns are now accepted. However, if	any repetition of  the
       subpattern matches no characters, the loop is forcibly broken.

       By  default,  the quantifiers are "greedy", that	is, they match as much
       as possible (up to the maximum  number  of  permitted  times),  without
       causing	the  remaining	pattern	 to fail. The classic example of where
       this gives problems is in trying	to match comments in C programs. These
       appear between /* and */. Within	the comment, individual	* and /	 char-
       acters  can appear. An attempt to match C comments by applying the pat-
       tern

       /\*.*\*/

       to the string

       /* first	comment	*/  not	comment	 /* second comment */

       fails, as it matches the	entire string owing to the greediness  of  the
       .* item.

       However,	 if  a quantifier is followed by a question mark, it ceases to
       be greedy, and instead matches the minimum number of times possible, so
       the following pattern does the right thing with the C comments:

       /\*.*?\*/

       The meaning of the various quantifiers is not otherwise	changed,  only
       the  preferred  number  of matches. Do not confuse this use of question
       mark with its use as a quantifier in its	own right. As it has two uses,
       it can sometimes	appear doubled,	as in

       \d??\d

       which matches one digit by preference, but can match two	if that	is the
       only way	the remaining pattern matches.

       If option ungreedy is set (an option that is not	 available  in	Perl),
       the  quantifiers	 are not greedy	by default, but	individual ones	can be
       made greedy by following	them with a question mark. That	is, it inverts
       the default behavior.

       When a parenthesized subpattern is quantified  with  a  minimum	repeat
       count  that  is	> 1 or with a limited maximum, more memory is required
       for the compiled	pattern, in proportion to the size of the  minimum  or
       maximum.

       If  a  pattern starts with .* or	.{0,} and option dotall	(equivalent to
       Perl option /s) is set, thus allowing the dot to	 match	newlines,  the
       pattern	is  implicitly	anchored,  because  whatever  follows is tried
       against every character position	in the subject string. So, there is no
       point in	retrying the overall match at any position  after  the	first.
       PCRE normally treats such a pattern as if it was	preceded by \A.

       In  cases  where	 it  is	known that the subject string contains no new-
       lines, it is worth setting dotall to obtain this	optimization,  or  al-
       ternatively using ^ to indicate anchoring explicitly.

       However,	 there	are  some cases	where the optimization cannot be used.
       When .* is inside capturing parentheses that are	the subject of a  back
       reference elsewhere in the pattern, a match at the start	can fail where
       a later one succeeds. Consider, for example:

       (.*)abc\1

       If the subject is "xyz123abc123", the match point is the	fourth charac-
       ter. Therefore, such a pattern is not implicitly	anchored.

       Another	case where implicit anchoring is not applied is	when the lead-
       ing .* is inside	an atomic group. Once again, a match at	the start  can
       fail where a later one succeeds.	Consider the following pattern:

       (?>.*?a)b

       It  matches "ab"	in the subject "aab". The use of the backtracking con-
       trol verbs (*PRUNE) and (*SKIP) also disable this optimization.

       When a capturing	subpattern is repeated,	the value captured is the sub-
       string that matched the final iteration.	For example, after

       (tweedle[dume]{3}\s*)+

       has matched "tweedledum tweedledee", the	value  of  the	captured  sub-
       string  is "tweedledee".	However, if there are nested capturing subpat-
       terns, the corresponding	captured values	can have been set in  previous
       iterations. For example,	after

       /(a|(b))+/

       matches "aba", the value	of the second captured substring is "b".

ATOMIC GROUPING	AND POSSESSIVE QUANTIFIERS
       With  both  maximizing ("greedy") and minimizing	("ungreedy" or "lazy")
       repetition, failure of what follows normally causes the	repeated  item
       to  be  re-evaluated to see if a	different number of repeats allows the
       remaining pattern to match. Sometimes it	is useful to prevent this, ei-
       ther to change the nature of the	match, or to cause it to fail  earlier
       than  it	 otherwise  might,  when  the author of	the pattern knows that
       there is	no point in carrying on.

       Consider, for example, the pattern \d+foo when applied to the following
       subject line:

       123456bar

       After matching all six digits and then failing to match "foo", the nor-
       mal action of the matcher is to try again with only five	digits	match-
       ing item	\d+, and then with four, and so	on, before ultimately failing.
       "Atomic	grouping"  (a  term taken from Jeffrey Friedl's	book) provides
       the means for specifying	that once a subpattern has matched, it is  not
       to be re-evaluated in this way.

       If  atomic grouping is used for the previous example, the matcher gives
       up immediately on failing to match "foo"	the first time.	 The  notation
       is a kind of special parenthesis, starting with (?> as in the following
       example:

       (?>\d+)foo

       This kind of parenthesis	"locks up" the part of the pattern it contains
       once  it	 has  matched,	and a failure further into the pattern is pre-
       vented from backtracking	into it.  Backtracking	past  it  to  previous
       items, however, works as	normal.

       An  alternative	description  is	that a subpattern of this type matches
       the string of characters	that an	 identical  standalone	pattern	 would
       match, if anchored at the current point in the subject string.

       Atomic grouping subpatterns are not capturing subpatterns. Simple cases
       such as the above example can be	thought	of as a	maximizing repeat that
       must  swallow  everything  it can. So, while both \d+ and \d+? are pre-
       pared to	adjust the number of digits they match to make	the  remaining
       pattern match, (?>\d+) can only match an	entire sequence	of digits.

       Atomic  groups  in general can contain any complicated subpatterns, and
       can be nested. However, when the	subpattern for an atomic group is just
       a single	repeated item, as in the example above,	 a  simpler  notation,
       called a	"possessive quantifier"	can be used. This consists of an extra
       +  character  following a quantifier. Using this	notation, the previous
       example can be rewritten	as

       \d++foo

       Notice that a possessive	quantifier can be used with an	entire	group,
       for example:

       (abc|xyz){2,3}+

       Possessive  quantifiers	are  always  greedy; the setting of option un-
       greedy is ignored. They are a convenient	notation for the simpler forms
       of an atomic group. However, there is no	difference in the meaning of a
       possessive quantifier and the equivalent	atomic group, but there	can be
       a performance difference; possessive quantifiers	are probably  slightly
       faster.

       The  possessive	quantifier syntax is an	extension to the Perl 5.8 syn-
       tax. Jeffrey Friedl originated the idea (and the	 name)	in  the	 first
       edition of his book. Mike McCloskey liked it, so	implemented it when he
       built  the  Sun	Java  package, and PCRE	copied it from there. It ulti-
       mately found its	way into Perl at release 5.10.

       PCRE has	an optimization	that automatically "possessifies" certain sim-
       ple pattern constructs. For example, the	sequence  A+B  is  treated  as
       A++B,  as there is no point in backtracking into	a sequence of A:s when
       B must follow.

       When a pattern contains an unlimited repeat inside  a  subpattern  that
       can  itself  be	repeated  an  unlimited	number of times, the use of an
       atomic group is the only	way to avoid some  failing  matches  taking  a
       long time. The pattern

       (\D+|<\d+>)*[!?]

       matches	an  unlimited number of	substrings that	either consist of non-
       digits, or digits enclosed in <>, followed by ! or ?. When it  matches,
       it runs quickly.	However, if it is applied to

       aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa

       it  takes  a  long  time	 before	reporting failure. This	is because the
       string can be divided between the internal \D+ repeat and the  external
       *  repeat  in  many ways, and all must be tried.	(The example uses [!?]
       rather than a single character at the end, as both PCRE and  Perl  have
       an optimization that allows for fast failure when a single character is
       used.  They  remember  the last single character	that is	required for a
       match, and fail early if	it is not present in the string.) If the  pat-
       tern  is	 changed  so that it uses an atomic group, like	the following,
       sequences of non-digits cannot be broken, and failure happens quickly:

       ((?>\D+)|<\d+>)*[!?]

BACK REFERENCES
       Outside a character class, a backslash followed by a  digit  >  0  (and
       possibly	 further digits) is a back reference to	a capturing subpattern
       earlier (that is, to its	left) in the pattern, provided there have been
       that many previous capturing left parentheses.

       However,	if the decimal number following	the backslash is < 10,	it  is
       always taken as a back reference, and causes an error only if there are
       not  that  many	capturing left parentheses in the entire pattern. That
       is, the parentheses that	are referenced do need not be to the  left  of
       the reference for numbers < 10. A "forward back reference" of this type
       can  make sense when a repetition is involved and the subpattern	to the
       right has participated in an earlier iteration.

       It is not possible to have a numerical "forward back  reference"	 to  a
       subpattern  whose number	is 10 or more using this syntax, as a sequence
       such as \50 is interpreted as a character defined in  octal.  For  more
       details	of  the	 handling of digits following a	backslash, see section
       Non-Printing Characters earlier.	There is no such  problem  when	 named
       parentheses  are	 used.	A back reference to any	subpattern is possible
       using named parentheses (see below).

       Another way to avoid the	ambiguity inherent in the use of  digits  fol-
       lowing  a  backslash is to use the \g escape sequence. This escape must
       be followed by an unsigned number or a negative number, optionally  en-
       closed in braces. The following examples	are identical:

       (ring), \1
       (ring), \g1
       (ring), \g{1}

       An  unsigned number specifies an	absolute reference without the ambigu-
       ity that	is present in the older	syntax.	It is also useful when literal
       digits follow the reference. A negative number is a relative reference.
       Consider	the following example:

       (abc(def)ghi)\g{-1}

       The sequence \g{-1} is a	reference to the most recently started captur-
       ing subpattern before \g, that is, it is	equivalent to \2 in this exam-
       ple. Similarly, \g{-2} would be equivalent to \1. The use  of  relative
       references  can	be helpful in long patterns, and also in patterns that
       are created by joining fragments	 containing  references	 within	 them-
       selves.

       A  back	reference matches whatever matched the capturing subpattern in
       the current subject string, rather than anything	matching  the  subpat-
       tern itself (section Subpattern as Subroutines describes	a way of doing
       that).  So,  the	 following pattern matches "sense and sensibility" and
       "response and responsibility", but not "sense and responsibility":

       (sens|respons)e and \1ibility

       If caseful matching is in force at the time of the back reference,  the
       case  of	 letters  is relevant. For example, the	following matches "rah
       rah" and	"RAH RAH", but not "RAH	rah", although the original  capturing
       subpattern is matched caselessly:

       ((?i)rah)\s+\1

       There  are many different ways of writing back references to named sub-
       patterns. The .NET syntax \k{name} and  the  Perl  syntax  \k<name>  or
       \k'name'	 are supported,	as is the Python syntax	(?P=name). The unified
       back reference syntax in	Perl 5.10, in which \g can be  used  for  both
       numeric	and  named references, is also supported. The previous example
       can be rewritten	in the following ways:

       (?<p1>(?i)rah)\s+\k<p1>
       (?'p1'(?i)rah)\s+\k{p1}
       (?P<p1>(?i)rah)\s+(?P=p1)
       (?<p1>(?i)rah)\s+\g{p1}

       A subpattern that is referenced by name can appear in the  pattern  be-
       fore or after the reference.

       There  can be more than one back	reference to the same subpattern. If a
       subpattern has not been used in a particular match, any back references
       to it always fails. For example,	the following pattern always fails  if
       it starts to match "a" rather than "bc":

       (a|(bc))\2

       As  there  can  be  many	capturing parentheses in a pattern, all	digits
       following the backslash are taken as part of a potential	back reference
       number. If the pattern continues	with a digit character,	some delimiter
       must be used to terminate the back reference.  If  option  extended  is
       set,  this  can	be whitespace. Otherwise an empty comment (see section
       Comments) can be	used.

       Recursive Back References

       A back reference	that occurs inside the parentheses to which it	refers
       fails  when  the	subpattern is first used, so, for example, (a\1) never
       matches.	However, such references can be	useful inside repeated subpat-
       terns. For example, the following pattern matches any  number  of  "a"s
       and also	"aba", "ababbaa", and so on:

       (a|b\1)+

       At  each	 iteration  of	the subpattern,	the back reference matches the
       character string	corresponding to the previous iteration. In order  for
       this  to	 work,	the pattern must be such that the first	iteration does
       not need	to match the back reference. This can be done  using  alterna-
       tion,  as  in  the  example above, or by	a quantifier with a minimum of
       zero.

       Back references of this type cause the group that they reference	to  be
       treated	as  an	atomic group. Once the whole group has been matched, a
       subsequent matching failure cannot cause	backtracking into  the	middle
       of the group.

ASSERTIONS
       An  assertion  is  a  test on the characters following or preceding the
       current matching	point that does	not consume any	characters. The	simple
       assertions coded	as \b, \B, \A, \G, \Z, \z, ^, and $ are	 described  in
       the previous sections.

       More  complicated  assertions  are  coded as subpatterns. There are two
       kinds: those that look ahead of the current  position  in  the  subject
       string,	and  those  that  look	behind	it. An assertion subpattern is
       matched in the normal way, except that it does not  cause  the  current
       matching	position to be changed.

       Assertion  subpatterns are not capturing	subpatterns. If	such an	asser-
       tion contains capturing subpatterns within it, these  are  counted  for
       the  purposes  of numbering the capturing subpatterns in	the whole pat-
       tern. However, substring	capturing is done  only	 for  positive	asser-
       tions.  (Perl sometimes,	but not	always,	performs capturing in negative
       assertions.)

   Warning:
       If a positive assertion containing one or  more	capturing  subpatterns
       succeeds, but failure to	match later in the pattern causes backtracking
       over  this  assertion, the captures within the assertion	are reset only
       if no higher numbered captures are already set. This is,	unfortunately,
       a fundamental limitation	of the current implementation, and as PCRE1 is
       now in maintenance-only status, it is unlikely ever to change.

       For compatibility with Perl, assertion  subpatterns  can	 be  repeated.
       However,	 it  makes  no	sense to assert	the same thing many times, the
       side effect of capturing	parentheses can	 occasionally  be  useful.  In
       practice, there are only	three cases:

	 * If  the  quantifier	is  {0},  the assertion	is never obeyed	during
	   matching. However, it can contain internal capturing	 parenthesized
	   groups that are called from elsewhere through the subroutine	mecha-
	   nism.

	 * If  quantifier  is  {0,n},  where n > 0, it is treated as if	it was
	   {0,1}. At runtime, the remaining pattern match is  tried  with  and
	   without  the	 assertion, the	order depends on the greediness	of the
	   quantifier.

	 * If the minimum repetition is	> 0, the quantifier  is	 ignored.  The
	   assertion is	obeyed only once when encountered during matching.

       Lookahead Assertions

       Lookahead assertions start with (?= for positive	assertions and (?! for
       negative	assertions. For	example, the following matches a word followed
       by a semicolon, but does	not include the	semicolon in the match:

       \w+(?=;)

       The  following  matches any occurrence of "foo" that is not followed by
       "bar":

       foo(?!bar)

       Notice that the apparently similar pattern

       (?!foo)bar

       does not	find an	occurrence of "bar"  that  is  preceded	 by  something
       other  than  "foo". It finds any	occurrence of "bar" whatsoever,	as the
       assertion (?!foo) is always true	when the  next	three  characters  are
       "bar". A	lookbehind assertion is	needed to achieve the other effect.

       If you want to force a matching failure at some point in	a pattern, the
       most  convenient	 way  to do it is with (?!), as	an empty string	always
       matches.	So, an assertion that requires there is	not  to	 be  an	 empty
       string  must always fail. The backtracking control verb (*FAIL) or (*F)
       is a synonym for	(?!).

       Lookbehind Assertions

       Lookbehind assertions start with	(?<= for positive assertions and  (?<!
       for negative assertions.	For example, the following finds an occurrence
       of "bar"	that is	not preceded by	"foo":

       (?<!foo)bar

       The contents of a lookbehind assertion are restricted such that all the
       strings it matches must have a fixed length. However, if	there are many
       top-level  alternatives,	 they  do  not all have	to have	the same fixed
       length. Thus, the following is permitted:

       (?<=bullock|donkey)

       The following causes an error at	compile	time:

       (?<!dogs?|cats?)

       Branches	that match different length strings are	permitted only at  the
       top-level of a lookbehind assertion. This is an extension compared with
       Perl,  which  requires all branches to match the	same length of string.
       An assertion such as the	following is not permitted, as its single top-
       level branch can	match two different lengths:

       (?<=ab(c|de))

       However,	it is acceptable to PCRE if rewritten  to  use	two  top-level
       branches:

       (?<=abc|abde)

       Sometimes  the  escape sequence \K (see above) can be used instead of a
       lookbehind assertion to get round the fixed-length restriction.

       The implementation of lookbehind	assertions is, for  each  alternative,
       to  move	 the current position back temporarily by the fixed length and
       then try	to match. If there are insufficient characters before the cur-
       rent position, the assertion fails.

       In a UTF	mode, PCRE does	not allow the \C escape	(which matches a  sin-
       gle  data  unit even in a UTF mode) to appear in	lookbehind assertions,
       as it makes it impossible to calculate the length  of  the  lookbehind.
       The \X and \R escapes, which can	match different	numbers	of data	units,
       are not permitted either.

       "Subroutine" calls (see below), such as (?2) or (?&X), are permitted in
       lookbehinds,  as	 long as the subpattern	matches	a fixed-length string.
       Recursion, however, is not supported.

       Possessive quantifiers can be used with lookbehind assertions to	 spec-
       ify  efficient  matching	 of fixed-length strings at the	end of subject
       strings.	Consider the following simple pattern when applied to  a  long
       string that does	not match:

       abcd$

       As matching proceeds from left to right,	PCRE looks for each "a"	in the
       subject and then	sees if	what follows matches the remaining pattern. If
       the pattern is specified	as

       ^.*abcd$

       the  initial  .*	matches	the entire string at first. However, when this
       fails (as there is no following "a"), it	backtracks to  match  all  but
       the  last  character,  then all but the last two	characters, and	so on.
       Once again the search for "a" covers the	entire string, from  right  to
       left, so	we are no better off. However, if the pattern is written as

       ^.*+(?<=abcd)

       there  can  be  no backtracking for the .*+ item; it can	match only the
       entire string. The subsequent lookbehind	assertion does a  single  test
       on  the last four characters. If	it fails, the match fails immediately.
       For long	strings, this approach makes a significant difference  to  the
       processing time.

       Using Multiple Assertions

       Many assertions (of any sort) can occur in succession. For example, the
       following matches "foo" preceded	by three digits	that are not "999":

       (?<=\d{3})(?<!999)foo

       Notice that each	of the assertions is applied independently at the same
       point  in  the subject string. First there is a check that the previous
       three characters	are all	digits,	and then there is  a  check  that  the
       same  three characters are not "999". This pattern does not match "foo"
       preceded	by six characters, the first of	which are digits and the  last
       three  of  which	are not	"999". For example, it does not	match "123abc-
       foo". A pattern to do that is the following:

       (?<=\d{3}...)(?<!999)foo

       This time the first assertion looks at the  preceding  six  characters,
       checks  that  the first three are digits, and then the second assertion
       checks that the preceding three characters are not "999".

       Assertions can be nested	in any combination. For	example, the following
       matches an occurrence of	"baz" that is preceded by "bar", which in turn
       is not preceded by "foo":

       (?<=(?<!foo)bar)baz

       The following pattern matches "foo" preceded by three  digits  and  any
       three characters	that are not "999":

       (?<=\d{3}(?!999)...)foo

CONDITIONAL SUBPATTERNS
       It  is possible to cause	the matching process to	obey a subpattern con-
       ditionally or to	choose between two alternative subpatterns,  depending
       on  the result of an assertion, or whether a specific capturing subpat-
       tern has	already	been matched. The following are	the two	possible forms
       of conditional subpattern:

       (?(condition)yes-pattern)
       (?(condition)yes-pattern|no-pattern)

       If the condition	is satisfied, the yes-pattern is used,	otherwise  the
       no-pattern  (if	present).  If  more than two alternatives exist	in the
       subpattern, a compile-time error	occurs.	Each of	the  two  alternatives
       can  itself  contain  nested  subpatterns of any	form, including	condi-
       tional subpatterns; the restriction to two alternatives applies only at
       the level of the	condition. The following pattern fragment is an	 exam-
       ple where the alternatives are complex:

       (?(1) (A|B|C) | (D | (?(2)E|F) |	E) )

       There  are  four	 kinds of condition: references	to subpatterns,	refer-
       ences to	recursion, a pseudo-condition called DEFINE, and assertions.

       Checking	for a Used Subpattern By Number

       If the text between the parentheses consists of a sequence  of  digits,
       the condition is	true if	a capturing subpattern of that number has pre-
       viously	matched.  If  more than	one capturing subpattern with the same
       number exists (see section  Duplicate Subpattern	Numbers	earlier),  the
       condition  is true if any of them have matched. An alternative notation
       is to precede the digits	with a plus or minus sign. In this  case,  the
       subpattern  number  is relative rather than absolute. The most recently
       opened parentheses can be referenced by (?(-1), the next	most recent by
       (?(-2), and so on. Inside loops,	it can also make  sense	 to  refer  to
       subsequent  groups. The next parentheses	to be opened can be referenced
       as (?(+1), and so on. (The value	zero in	any  of	 these	forms  is  not
       used; it	provokes a compile-time	error.)

       Consider	 the  following	pattern, which contains	non-significant	white-
       space to	make it	more readable (assume option extended) and  to	divide
       it into three parts for ease of discussion:

       ( \( )?	  [^()]+    (?(1) \) )

       The  first  part	 matches  an optional opening parenthesis, and if that
       character is present, sets it as	the first captured substring. The sec-
       ond part	matches	one or more characters that are	not  parentheses.  The
       third part is a conditional subpattern that tests whether the first set
       of parentheses matched or not. If they did, that	is, if subject started
       with an opening parenthesis, the	condition is true, and so the yes-pat-
       tern  is	 executed and a	closing	parenthesis is required. Otherwise, as
       no-pattern is not present, the subpattern  matches  nothing.  That  is,
       this pattern matches a sequence of non-parentheses, optionally enclosed
       in parentheses.

       If  this	 pattern is embedded in	a larger one, a	relative reference can
       be used:

       This makes the fragment independent of the parentheses  in  the	larger
       pattern.

       Checking	for a Used Subpattern By Name

       Perl  uses  the	syntax	(?(<name>)...) or (?('name')...) to test for a
       used subpattern by name.	For compatibility  with	 earlier  versions  of
       PCRE,  which  had this facility before Perl, the	syntax (?(name)...) is
       also recognized.

       Rewriting the previous example to use a named subpattern	gives:

       (?<OPEN>	\( )?	 [^()]+	   (?(<OPEN>) \) )

       If the name used	in a condition of this kind is a duplicate,  the  test
       is  applied to all subpatterns of the same name,	and is true if any one
       of them has matched.

       Checking	for Pattern Recursion

       If the condition	is the string (R), and there is	no subpattern with the
       name R, the condition is	true if	a recursive call to the	whole  pattern
       or any subpattern has been made.	If digits or a name preceded by	amper-
       sand follow the letter R, for example:

       (?(R3)...) or (?(R&name)...)

       the condition is	true if	the most recent	recursion is into a subpattern
       whose number or name is given. This condition does not check the	entire
       recursion  stack. If the	name used in a condition of this kind is a du-
       plicate,	the test is applied to all subpatterns of the same  name,  and
       is true if any one of them is the most recent recursion.

       At "top-level", all these recursion test	conditions are false. The syn-
       tax for recursive patterns is described below.

       Defining	Subpatterns for	Use By Reference Only

       If  the	condition  is  the string (DEFINE), and	there is no subpattern
       with the	name DEFINE, the condition is  always  false.  In  this	 case,
       there  can  be  only  one  alternative  in the subpattern. It is	always
       skipped if control reaches this point in	the pattern. The idea  of  DE-
       FINE  is	that it	can be used to define "subroutines" that can be	refer-
       enced from elsewhere. (The use of subroutines is	described below.)  For
       example,	 a pattern to match an IPv4 address, such as "192.168.23.245",
       can be written like this	(ignore	whitespace and line breaks):

       (?(DEFINE) (?<byte> 2[0-4]\d | 25[0-5] |	1\d\d |	[1-9]?\d) ) \b (?&byte)	(\.(?&byte)){3}	\b

       The first part of the pattern is	a DEFINE group inside which is	a  an-
       other  group named "byte" is defined. This matches an individual	compo-
       nent of an IPv4 address (a number < 256). When  matching	 takes	place,
       this part of the	pattern	is skipped, as DEFINE acts like	a false	condi-
       tion. The remaining pattern uses	references to the named	group to match
       the  four  dot-separated	 components of an IPv4 address,	insisting on a
       word boundary at	each end.

       Assertion Conditions

       If the condition	is not in any of the above formats, it must be an  as-
       sertion.	This can be a positive or negative lookahead or	lookbehind as-
       sertion.	 Consider  the	following  pattern, containing non-significant
       whitespace, and with the	two alternatives on the	second line:

       (?(?=[^a-z]*[a-z])
       \d{2}-[a-z]{3}-\d{2}  |	\d{2}-\d{2}-\d{2} )

       The condition is	a positive lookahead assertion	that  matches  an  op-
       tional  sequence	of non-letters followed	by a letter. That is, it tests
       for the presence	of at least one	letter in the subject. If a letter  is
       found,  the subject is matched against the first	alternative, otherwise
       it is matched against the second. This pattern matches strings  in  one
       of  the	two  forms dd-aaa-dd or	dd-dd-dd, where	aaa are	letters	and dd
       are digits.

COMMENTS
       There are two ways to include comments in patterns that	are  processed
       by PCRE.	In both	cases, the start of the	comment	must not be in a char-
       acter  class, or	in the middle of any other sequence of related charac-
       ters such as (?:	or a subpattern	name or	number.	 The  characters  that
       make up a comment play no part in the pattern matching.

       The  sequence (?# marks the start of a comment that continues up	to the
       next closing parenthesis. Nested	parentheses are	not permitted. If  op-
       tion  PCRE_EXTENDED  is set, an unescaped # character also introduces a
       comment,	which in this case continues to	 immediately  after  the  next
       newline	character  or character	sequence in the	pattern. Which charac-
       ters are	interpreted as newlines	is controlled by the options passed to
       a compiling function or by a special sequence at	the start of the  pat-
       tern, as	described in section  Newline Conventions earlier.

       Notice  that  the  end of this type of comment is a literal newline se-
       quence in the pattern; escape sequences that happen to represent	a new-
       line do not count. For example, consider	the following pattern when ex-
       tended is set, and the default newline convention is in force:

       abc #comment \n still comment

       On encountering character #, pcre_compile() skips along,	looking	for  a
       newline in the pattern. The sequence \n is still	literal	at this	stage,
       so  it does not terminate the comment. Only a character with code value
       0x0a (the default newline) does so.

RECURSIVE PATTERNS
       Consider	the problem of matching	a string in parentheses, allowing  for
       unlimited  nested  parentheses.	Without	the use	of recursion, the best
       that can	be done	is to use a pattern that  matches  up  to  some	 fixed
       depth  of  nesting.  It	is not possible	to handle an arbitrary nesting
       depth.

       For some	time, Perl has provided	a facility that	allows regular expres-
       sions to	recurse	(among other things). It does  this  by	 interpolating
       Perl  code  in the expression at	runtime, and the code can refer	to the
       expression itself. A Perl pattern using code interpolation to solve the
       parentheses problem can be created like this:

       $re = qr{\( (?: (?>[^()]+) | (?p{$re}) )* \)}x;

       Item (?p{...}) interpolates Perl	code at	 runtime,  and	in  this  case
       refers recursively to the pattern in which it appears.

       Obviously, PCRE cannot support the interpolation	of Perl	code. Instead,
       it supports special syntax for recursion	of the entire pattern, and for
       individual  subpattern  recursion.  After  its introduction in PCRE and
       Python, this kind of recursion was later	introduced into	 Perl  at  re-
       lease 5.10.

       A special item that consists of (? followed by a	number > 0 and a clos-
       ing parenthesis is a recursive subroutine call of the subpattern	of the
       given  number,  if  it  occurs inside that subpattern. (If not, it is a
       non-recursive subroutine	call, which is described in the	next section.)
       The special item	(?R) or	(?0) is	a recursive call of the	entire regular
       expression.

       This PCRE pattern solves	the nested parentheses	problem	 (assume  that
       option extended is set so that whitespace is ignored):

       \( ( [^()]++ | (?R) )* \)

       First  it matches an opening parenthesis. Then it matches any number of
       substrings, which can either be a sequence of non-parentheses or	a  re-
       cursive match of	the pattern itself (that is, a correctly parenthesized
       substring). Finally there is a closing parenthesis. Notice the use of a
       possessive  quantifier  to  avoid  backtracking	into sequences of non-
       parentheses.

       If this was part	of a larger pattern, you would not want	to recurse the
       entire pattern, so instead you can use:

       ( \( ( [^()]++ |	(?1) )*	\) )

       The pattern is here within parentheses so that the recursion refers  to
       them instead of the whole pattern.

       In  a  larger  pattern,	keeping	 track	of  parenthesis	numbers	can be
       tricky. This is made easier by the use of relative references.  Instead
       of  (?1)	in the pattern above, you can write (?-2) to refer to the sec-
       ond most	recently opened	parentheses preceding the recursion. That  is,
       a negative number counts	capturing parentheses leftwards	from the point
       at which	it is encountered.

       It  is  also  possible to refer to later	opened parentheses, by writing
       references such as (?+2). However, these	cannot be  recursive,  as  the
       reference  is  not inside the parentheses that are referenced. They are
       always non-recursive subroutine calls, as described in  the  next  sec-
       tion.

       An  alternative	approach is to use named parentheses instead. The Perl
       syntax for this is (?&name). The	earlier	PCRE syntax (?P>name) is  also
       supported. We can rewrite the above example as follows:

       (?<pn> \( ( [^()]++ | (?&pn) )* \) )

       If  there  is more than one subpattern with the same name, the earliest
       one is used.

       This particular example pattern that we have  studied  contains	nested
       unlimited repeats, and so the use of a possessive quantifier for	match-
       ing  strings  of	non-parentheses	is important when applying the pattern
       to strings that do not match. For example, when this pattern is applied
       to

       (aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa()

       it gives	"no match" quickly. However, if	a possessive quantifier	is not
       used, the match runs for	a long time, as	there are  so  many  different
       ways  the  +  and  *  repeats can carve up the subject, and all must be
       tested before failure can be reported.

       At the end of a match, the values of capturing  parentheses  are	 those
       from the	outermost level. If the	pattern	above is matched against

       (ab(cd)ef)

       the  value  for	the  inner capturing parentheses (numbered 2) is "ef",
       which is	the last value taken on	at the top-level. If a capturing  sub-
       pattern	is  not	 matched at the	top level, its final captured value is
       unset, even if it was (temporarily) set at a deeper  level  during  the
       matching	process.

       Do not confuse item (?R)	with condition (R), which tests	for recursion.
       Consider	 the  following	pattern, which matches text in angle brackets,
       allowing	for arbitrary nesting.	Only  digits  are  allowed  in	nested
       brackets	 (that is, when	recursing), while any characters are permitted
       at the outer level.

       < (?: (?(R) \d++	 | [^<>]*+) | (?R)) * >

       Here (?(R) is the start of a conditional	subpattern, with two different
       alternatives for	the recursive and non-recursive	cases.	Item  (?R)  is
       the actual recursive call.

       Differences in Recursion	Processing between PCRE	and Perl

       Recursion  processing  in PCRE differs from Perl	in two important ways.
       In PCRE (like Python, but unlike	Perl), a recursive subpattern call  is
       always treated as an atomic group. That is, once	it has matched some of
       the subject string, it is never re-entered, even	if it contains untried
       alternatives  and  there	 is a subsequent matching failure. This	can be
       illustrated by the following pattern, which means  to  match  a	palin-
       dromic string containing	an odd number of characters (for example, "a",
       "aba", "abcba", "abcdcba"):

       ^(.|(.)(?1)\2)$

       The idea	is that	it either matches a single character, or two identical
       characters surrounding a	subpalindrome. In Perl,	this pattern works; in
       PCRE  it	 does not work if the pattern is longer	than three characters.
       Consider	the subject string "abcba".

       At the top level, the first character is	matched, but as	it is  not  at
       the end of the string, the first	alternative fails, the second alterna-
       tive  is	 taken,	and the	recursion kicks	in. The	recursive call to sub-
       pattern 1 successfully matches the next character ("b").	 (Notice  that
       the beginning and end of	line tests are not part	of the recursion.)

       Back  at	 the top level,	the next character ("c") is compared with what
       subpattern 2 matched, which was "a". This fails.	As  the	 recursion  is
       treated	as  an atomic group, there are now no backtracking points, and
       so the entire match fails. (Perl	can now	re-enter the recursion and try
       the second alternative.)	However, if the	pattern	is  written  with  the
       alternatives in the other order,	things are different:

       ^((.)(?1)\2|.)$

       This  time,  the	recursing alternative is tried first, and continues to
       recurse until it	runs out of characters,	at which point	the  recursion
       fails.  But  this time we have another alternative to try at the	higher
       level. That is the significant difference: in the previous case the re-
       maining alternative is at a deeper recursion level, which  PCRE	cannot
       use.

       To  change  the pattern so that it matches all palindromic strings, not
       only those with an odd number of	characters, it is tempting  to	change
       the pattern to this:

       ^((.)(?1)\2|.?)$

       Again,  this  works  in Perl, but not in	PCRE, and for the same reason.
       When a deeper recursion has matched a single character,	it  cannot  be
       entered again to	match an empty string. The solution is to separate the
       two  cases, and write out the odd and even cases	as alternatives	at the
       higher level:

       ^(?:((.)(?1)\2|)|((.)(?3)\4|.))

       If you want to match typical palindromic	phrases, the pattern must  ig-
       nore all	non-word characters, which can be done as follows:

       ^\W*+(?:((.)\W*+(?1)\W*+\2|)|((.)\W*+(?3)\W*+\4|\W*+.\W*+))\W*+$

       If  run	with  option caseless, this pattern matches phrases such as "A
       man, a plan, a canal: Panama!" and it works well	in both	PCRE and Perl.
       Notice the use of the possessive	quantifier *+  to  avoid  backtracking
       into  sequences	of  non-word characters. Without this, PCRE takes much
       longer (10 times	or more) to match typical phrases, and Perl  takes  so
       long that you think it has gone into a loop.

   Note:
       The  palindrome-matching	patterns above work only if the	subject	string
       does not	start with a  palindrome  that	is  shorter  than  the	entire
       string. For example, although "abcba" is	correctly matched, if the sub-
       ject  is	 "ababa",  PCRE	 finds palindrome "aba"	at the start, and then
       fails at	top level, as the end of the  string  does  not	 follow.  Once
       again,  it  cannot  jump	 back into the recursion to try	other alterna-
       tives, so the entire match fails.

       The second way in which PCRE and	Perl differ in	their  recursion  pro-
       cessing	is in the handling of captured values. In Perl,	when a subpat-
       tern is called recursively or as	a subpattern (see the  next  section),
       it  has	no  access to any values that were captured outside the	recur-
       sion. In	PCRE these values can be referenced.  Consider	the  following
       pattern:

       ^(.)(\1|a(?2))

       In  PCRE,  it matches "bab". The	first capturing	parentheses match "b",
       then in the second group, when the back reference  \1  fails  to	 match
       "b",  the second	alternative matches "a", and then recurses. In the re-
       cursion,	\1 does	now match "b" and so  the  whole  match	 succeeds.  In
       Perl,  the  pattern fails to match because inside the recursive call \1
       cannot access the externally set	value.

SUBPATTERNS AS SUBROUTINES
       If the syntax for a recursive subpattern	call (either by	number	or  by
       name)  is  used outside the parentheses to which	it refers, it operates
       like a subroutine in a programming language. The	called subpattern  can
       be  defined  before or after the	reference. A numbered reference	can be
       absolute	or relative, as	in the following examples:

       (...(absolute)...)...(?2)...
       (...(relative)...)...(?-1)...
       (...(?+1)...(relative)...

       An earlier example pointed  out	that  the  following  pattern  matches
       "sense  and  sensibility"  and  "response  and responsibility", but not
       "sense and responsibility":

       (sens|respons)e and \1ibility

       If instead the following	pattern	is used, it matches "sense and respon-
       sibility" and the other two strings:

       (sens|respons)e and (?1)ibility

       Another example is provided in the discussion of	DEFINE earlier.

       All subroutine calls, recursive or not, are always  treated  as	atomic
       groups.	That  is,  once	 a  subroutine has matched some	of the subject
       string, it is never re-entered, even if it  contains  untried  alterna-
       tives  and there	is a subsequent	matching failure. Any capturing	paren-
       theses that are set during the subroutine call revert to	their previous
       values afterwards.

       Processing options such as case-independence are	fixed when  a  subpat-
       tern  is	defined, so if it is used as a subroutine, such	options	cannot
       be changed for different	calls.	For  example,  the  following  pattern
       matches	"abcabc"  but not "abcABC", as the change of processing	option
       does not	affect the called subpattern:

       (abc)(?i:(?-1))

ONIGURUMA SUBROUTINE SYNTAX
       For compatibility with Oniguruma, the non-Perl syntax \g	followed by  a
       name or a number	enclosed either	in angle brackets or single quotes, is
       alternative syntax for referencing a subpattern as a subroutine,	possi-
       bly recursively.	Here follows two of the	examples used above, rewritten
       using this syntax:

       (?<pn> \( ( (?>[^()]+) |	\g<pn> )* \) )
       (sens|respons)e and \g'1'ibility

       PCRE  supports  an extension to Oniguruma: if a number is preceded by a
       plus or minus sign, it is taken as a relative reference,	for example:

       (abc)(?i:\g<-1>)

       Notice that \g{...} (Perl syntax) and \g<...>  (Oniguruma  syntax)  are
       not synonymous. The former is a back reference; the latter is a subrou-
       tine call.

BACKTRACKING CONTROL
       Perl  5.10  introduced some "Special Backtracking Control Verbs", which
       are still described in the Perl documentation as	"experimental and sub-
       ject to change or removal in a future version of	Perl". It goes	on  to
       say:  "Their usage in production	code should be noted to	avoid problems
       during upgrades." The same remarks apply	to the PCRE features described
       in this section.

       The new verbs make use of what was previously invalid syntax: an	 open-
       ing parenthesis followed	by an asterisk.	They are generally of the form
       (*VERB)	or  (*VERB:NAME). Some can take	either form, possibly behaving
       differently depending on	whether	a name is present. A name is  any  se-
       quence  of  characters that does	not include a closing parenthesis. The
       maximum name length is 255 in the 8-bit library and 65535 in the	16-bit
       and 32-bit libraries. If	the name is empty, that	 is,  if  the  closing
       parenthesis  immediately	 follows  the  colon,  the effect is as	if the
       colon was not there. Any	number of these	verbs can occur	in a pattern.

       The behavior of these verbs in repeated groups, assertions, and in sub-
       patterns	called as subroutines (whether	or  not	 recursively)  is  de-
       scribed below.

       Optimizations That Affect Backtracking Verbs

       PCRE  contains some optimizations that are used to speed	up matching by
       running some checks at the start	of each	match attempt. For example, it
       can know	the minimum length of matching subject,	or that	 a  particular
       character must be present. When one of these optimizations bypasses the
       running	of a match, any	included backtracking verbs are	not processed.
       processed. You can suppress the start-of-match optimizations by setting
       option no_start_optimize	when calling compile/2 or run/3, or by	start-
       ing the pattern with (*NO_START_OPT).

       Experiments  with  Perl	suggest	that it	too has	similar	optimizations,
       sometimes leading to anomalous results.

       Verbs That Act Immediately

       The following verbs act as soon as they are encountered.	They must  not
       be followed by a	name.

       (*ACCEPT)

       This  verb causes the match to end successfully,	skipping the remainder
       of the pattern. However,	when it	is inside a subpattern that is	called
       as  a  subroutine, only that subpattern is ended	successfully. Matching
       then continues at the outer level. If (*ACCEPT) is triggered in a posi-
       tive assertion, the assertion succeeds; in a  negative  assertion,  the
       assertion fails.

       If  (*ACCEPT)  is inside	capturing parentheses, the data	so far is cap-
       tured. For example, the following matches "AB", "AAD", or  "ACD".  When
       it matches "AB",	"B" is captured	by the outer parentheses.

       A((?:A|B(*ACCEPT)|C)D)

       The  following  verb causes a matching failure, forcing backtracking to
       occur. It is equivalent to (?!) but easier to read.

       (*FAIL) or (*F)

       The Perl	documentation states that it is	probably useful	only when com-
       bined with (?{})	or (??{}).  Those  are	Perl  features	that  are  not
       present in PCRE.

       A  match	 with the string "aaaa"	always fails, but the callout is taken
       before each backtrack occurs (in	this example, 10 times).

       Recording Which Path Was	Taken

       The main	purpose	of this	verb is	to track how a match was  arrived  at,
       although	it also	has a secondary	use in with advancing the match	start-
       ing point (see (*SKIP) below).

   Note:
       In  Erlang,  there  is no interface to retrieve a mark with run/2,3, so
       only the	secondary purpose is relevant to the Erlang programmer.

       The rest	of this	section	is  therefore  deliberately  not  adapted  for
       reading	by  the	Erlang programmer, but the examples can	help in	under-
       standing	NAMES as they can be used by (*SKIP).

       (*MARK:NAME) or (*:NAME)

       A name is always	required with this verb. There	can  be	 as  many  in-
       stances	of  (*MARK)  as	 you like in a pattern,	and their names	do not
       have to be unique.

       When a match succeeds, the name of the last  encountered	 (*MARK:NAME),
       (*PRUNE:NAME),  or  (*THEN:NAME)	on the matching	path is	passed back to
       the caller as described in section "Extra data for pcre_exec()" in  the
       pcreapi documentation. In the following example of pcretest output, the
       /K modifier requests the	retrieval and outputting of (*MARK) data:

	 re> /X(*MARK:A)Y|X(*MARK:B)Z/K
       data> XY
	0: XY
       MK: A
       XZ
	0: XZ
       MK: B

       The (*MARK) name	is tagged with "MK:" in	this output, and in this exam-
       ple  it indicates which of the two alternatives matched.	This is	a more
       efficient way of	obtaining this information than	putting	each  alterna-
       tive in its own capturing parentheses.

       If  a  verb  with a name	is encountered in a positive assertion that is
       true, the name is recorded and passed back if it	is  the	 last  encoun-
       tered.  This does not occur for negative	assertions or failing positive
       assertions.

       After a partial match or	a failed match,	the last encountered  name  in
       the entire match	process	is returned, for example:

	 re> /X(*MARK:A)Y|X(*MARK:B)Z/K
       data> XP
       No match, mark =	B

       Notice  that  in	this unanchored	example, the mark is retained from the
       match attempt that started at letter "X"	 in  the  subject.  Subsequent
       match attempts starting at "P" and then with an empty string do not get
       as far as the (*MARK) item, nevertheless	do not reset it.

       Verbs That Act after Backtracking

       The following verbs do nothing when they	are encountered. Matching con-
       tinues  with what follows, but if there is no subsequent	match, causing
       a backtrack to the verb,	a failure is  forced.  That  is,  backtracking
       cannot  pass  to	the left of the	verb. However, when one	of these verbs
       appears inside an atomic	group or an assertion that is true, its	effect
       is confined to that group, as once the group has	been matched, there is
       never any backtracking into it. In  this	 situation,  backtracking  can
       "jump  back"  to	the left of the	entire atomic group or assertion. (Re-
       member also, as stated above, that this localization  also  applies  in
       subroutine calls.)

       These  verbs  differ  in	exactly	what kind of failure occurs when back-
       tracking	reaches	them. The behavior described below is what occurs when
       the verb	is not in a subroutine or an  assertion.  Subsequent  sections
       cover these special cases.

       The  following  verb,  which must not be	followed by a name, causes the
       whole match to fail outright if there is	a later	matching failure  that
       causes  backtracking to reach it. Even if the pattern is	unanchored, no
       further attempts	to find	a match	by advancing the starting  point  take
       place.

       (*COMMIT)

       If (*COMMIT) is the only	backtracking verb that is encountered, once it
       has  been  passed,  run/2,3 is committed	to find	a match	at the current
       starting	point, or not at all, for example:

       a+(*COMMIT)b

       This matches "xxaab" but	not "aacaab". It can be	thought	of as  a  kind
       of dynamic anchor, or "I've started, so I must finish". The name	of the
       most  recently passed (*MARK) in	the path is passed back	when (*COMMIT)
       forces a	match failure.

       If more than one	backtracking verb exists in a pattern, a different one
       that follows (*COMMIT) can be triggered first, so merely	passing	(*COM-
       MIT) during a match does	not always guarantee that a match must	be  at
       this starting point.

       Notice  that  (*COMMIT) at the start of a pattern is not	the same as an
       anchor, unless the PCRE start-of-match optimizations are	turned off, as
       shown in	the following example:

       1> re:run("xyzabc","(*COMMIT)abc",[{capture,all,list}]).
       {match,["abc"]}
       2> re:run("xyzabc","(*COMMIT)abc",[{capture,all,list},no_start_optimize]).
       nomatch

       For this	pattern, PCRE knows that any match must	start with "a",	so the
       optimization skips along	the subject to "a" before applying the pattern
       to the first set	of data. The match attempt then	succeeds. In the  sec-
       ond  call  the  no_start_optimize  disables the optimization that skips
       along to	the first character. The pattern is now	 applied  starting  at
       "x",  and  so the (*COMMIT) causes the match to fail without trying any
       other starting points.

       The following verb causes the match to fail at the current starting po-
       sition in the subject if	there is a later matching failure that	causes
       backtracking to reach it:

       (*PRUNE)	or (*PRUNE:NAME)

       If  the	pattern	 is  unanchored, the normal "bumpalong"	advance	to the
       next starting character then occurs. Backtracking can occur as usual to
       the left	of (*PRUNE), before it is reached, or  when  matching  to  the
       right  of (*PRUNE), but if there	is no match to the right, backtracking
       cannot cross (*PRUNE). In simple	cases, the use of (*PRUNE) is just  an
       alternative  to an atomic group or possessive quantifier, but there are
       some uses of (*PRUNE) that cannot be expressed in any other way.	In  an
       anchored	pattern, (*PRUNE) has the same effect as (*COMMIT).

       The    behavior	 of   (*PRUNE:NAME)   is   the	 not   the   same   as
       (*MARK:NAME)(*PRUNE). It	is like	(*MARK:NAME) in	that the name  is  re-
       membered	for passing back to the	caller.	However, (*SKIP:NAME) searches
       only for	names set with (*MARK).

   Note:
       The fact	that (*PRUNE:NAME) remembers the name is useless to the	Erlang
       programmer, as names cannot be retrieved.

       The  following  verb,  when specified without a name, is	like (*PRUNE),
       except that if the pattern is unanchored, the  "bumpalong"  advance  is
       not  to	the  next  character, but to the position in the subject where
       (*SKIP) was encountered.

       (*SKIP)

       (*SKIP) signifies that whatever text was	matched	leading	up to it  can-
       not be part of a	successful match. Consider:

       a+(*SKIP)b

       If  the	subject	 is  "aaaac...",  after	 the first match attempt fails
       (starting at the	first character	in the	string),  the  starting	 point
       skips  on  to  start  the next attempt at "c". Notice that a possessive
       quantifier does not have	the same effect	as this	example;  although  it
       would  suppress backtracking during the first match attempt, the	second
       attempt would start at the second character instead of skipping	on  to
       "c".

       When (*SKIP) has	an associated name, its	behavior is modified:

       (*SKIP:NAME)

       When  this  is  triggered,  the	previous  path	through	the pattern is
       searched	for the	most recent (*MARK) that has the same name. If one  is
       found,  the  "bumpalong"	advance	is to the subject position that	corre-
       sponds to that (*MARK) instead of to where (*SKIP) was encountered.  If
       no (*MARK) with a matching name is found, (*SKIP) is ignored.

       Notice  that  (*SKIP:NAME) searches only	for names set by (*MARK:NAME).
       It ignores names	that are set by	(*PRUNE:NAME) or (*THEN:NAME).

       The following verb causes a skip	to the next innermost alternative when
       backtracking reaches it.	That is, it cancels any	 further  backtracking
       within the current alternative.

       (*THEN) or (*THEN:NAME)

       The verb	name comes from	the observation	that it	can be used for	a pat-
       tern-based if-then-else block:

       ( COND1 (*THEN) FOO | COND2 (*THEN) BAR | COND3 (*THEN) BAZ ) ...

       If  the COND1 pattern matches, FOO is tried (and	possibly further items
       after the end of	the group if FOO succeeds). On	failure,  the  matcher
       skips  to  the second alternative and tries COND2, without backtracking
       into COND1. If that succeeds and	BAR fails, COND3 is tried. If BAZ then
       fails, there are	no more	alternatives, so there is a backtrack to what-
       ever came before	the entire group. If (*THEN) is	not inside an alterna-
       tion, it	acts like (*PRUNE).

       The   behavior	of   (*THEN:NAME)   is	 the   not   the    same    as
       (*MARK:NAME)(*THEN). It is like (*MARK:NAME) in that the	name is	remem-
       bered  for  passing  back to the	caller.	However, (*SKIP:NAME) searches
       only for	names set with (*MARK).

   Note:
       The fact	that (*THEN:NAME) remembers the	name is	useless	to the	Erlang
       programmer, as names cannot be retrieved.

       A  subpattern that does not contain a | character is just a part	of the
       enclosing alternative; it is not	a nested alternation with only one al-
       ternative. The effect of	(*THEN)	extends	beyond such  a	subpattern  to
       the  enclosing alternative. Consider the	following pattern, where A, B,
       and so on, are complex pattern fragments	that  do  not  contain	any  |
       characters at this level:

       A (B(*THEN)C) | D

       If  A and B are matched,	but there is a failure in C, matching does not
       backtrack into A; instead it moves to the next alternative, that	is, D.
       However,	if the subpattern containing (*THEN) is	given an  alternative,
       it behaves differently:

       A (B(*THEN)C | (*FAIL)) | D

       The  effect of (*THEN) is now confined to the inner subpattern. After a
       failure in C, matching moves to (*FAIL),	which causes the whole subpat-
       tern to fail, as	there are no more alternatives to try. In  this	 case,
       matching	does now backtrack into	A.

       Notice  that  a	conditional subpattern is not considered as having two
       alternatives, as	only one is ever used. That is,	the | character	 in  a
       conditional  subpattern	has  a different meaning. Ignoring whitespace,
       consider:

       ^.*? (?(?=a) a |	b(*THEN)c )

       If the subject is "ba", this pattern does not  match.  As  .*?  is  un-
       greedy,	it initially matches zero characters. The condition (?=a) then
       fails, the character "b"	is matched, but	"c" is	not.  At  this	point,
       matching	 does not backtrack to .*? as can perhaps be expected from the
       presence	of the | character. The	conditional subpattern is part of  the
       single  alternative  that comprises the whole pattern, and so the match
       fails. (If there	was a backtrack	into .*?, allowing it  to  match  "b",
       the match would succeed.)

       The verbs described above provide four different	"strengths" of control
       when subsequent matching	fails:

	 * (*THEN)  is the weakest, carrying on	the match at the next alterna-
	   tive.

	 * (*PRUNE) comes next,	fails the match	at the current starting	 posi-
	   tion,  but  allows  an  advance to the next character (for an unan-
	   chored pattern).

	 * (*SKIP) is similar, except that the advance can be  more  than  one
	   character.

	 * (*COMMIT) is	the strongest, causing the entire match	to fail.

       More than One Backtracking Verb

       If  more	 than  one  backtracking verb is present in a pattern, the one
       that is backtracked onto	first acts. For	example, consider the  follow-
       ing pattern, where A, B,	and so on, are complex pattern fragments:

       (A(*COMMIT)B(*THEN)C|ABD)

       If  A matches but B fails, the backtrack	to (*COMMIT) causes the	entire
       match to	fail. However, if A and	B match, but C fails, the backtrack to
       (*THEN) causes the next alternative (ABD) to be tried. This behavior is
       consistent, but is not always the same as in Perl. It means that	if two
       or more backtracking verbs appear in succession,	the last of  them  has
       no effect. Consider the following example:

       If there	is a matching failure to the right, backtracking onto (*PRUNE)
       causes  it to be	triggered, and its action is taken. There can never be
       a backtrack onto	(*COMMIT).

       Backtracking Verbs in Repeated Groups

       PCRE differs from Perl in its handling of  backtracking	verbs  in  re-
       peated groups. For example, consider:

       /(a(*COMMIT)b)+ac/

       If  the	subject	 is  "abac",  Perl matches, but	PCRE fails because the
       (*COMMIT) in the	second repeat of the group acts.

       Backtracking Verbs in Assertions

       (*FAIL) in an assertion has its normal effect: it forces	 an  immediate
       backtrack.

       (*ACCEPT) in a positive assertion causes	the assertion to succeed with-
       out  any	 further processing. In	a negative assertion, (*ACCEPT)	causes
       the assertion to	fail without any further processing.

       The other backtracking verbs are	not treated specially if  they	appear
       in  a  positive assertion. In particular, (*THEN) skips to the next al-
       ternative in the	innermost enclosing group that has  alternations,  re-
       gardless	if this	is within the assertion.

       Negative	 assertions are, however, different, to	ensure that changing a
       positive	assertion into a negative assertion changes its	result.	 Back-
       tracking	 into (*COMMIT), (*SKIP), or (*PRUNE) causes a negative	asser-
       tion to be true,	without	considering any	further	 alternative  branches
       in  the	assertion.  Backtracking into (*THEN) causes it	to skip	to the
       next enclosing alternative within the assertion (the normal  behavior),
       but if the assertion does not have such an alternative, (*THEN) behaves
       like (*PRUNE).

       Backtracking Verbs in Subroutines

       These  behaviors	 occur	regardless  if the subpattern is called	recur-
       sively. The treatment of	subroutines  in	 Perl  is  different  in  some
       cases.

	 * (*FAIL)  in	a subpattern called as a subroutine has	its normal ef-
	   fect: it forces an immediate	backtrack.

	 * (*ACCEPT) in	a subpattern called as a subroutine causes the subrou-
	   tine	match to succeed without any further processing. Matching then
	   continues after the subroutine call.

	 * (*COMMIT), (*SKIP), and (*PRUNE) in a subpattern called as  a  sub-
	   routine cause the subroutine	match to fail.

	 * (*THEN)  skips  to  the next	alternative in the innermost enclosing
	   group within	the subpattern that has	alternatives. If there	is  no
	   such	 group	within	the  subpattern, (*THEN) causes	the subroutine
	   match to fail.

Ericsson AB			  stdlib 5.2				 re(3)
Want to link to this manual page? Use this URL:
<https://man.freebsd.org/cgi/man.cgi?query=re&manpath=FreeBSD+15.1-RELEASE+and+Ports.quarterly>
home | help
Header And Logo

Peripheral Links

Site Navigation

FreeBSD Manual Pages

Header And Logo

Peripheral Links

Search

Site Navigation

FreeBSD Manual Pages