FreeBSD Manual Pages

home | help
PEG(1)			    General Commands Manual			PEG(1)

NAME
       peg, leg	- parser generators

SYNOPSIS
       peg [-hvV -ooutput] [filename ...]
       leg [-hvV -ooutput] [filename ...]

DESCRIPTION
       peg  and	 leg  are tools	for generating recursive-descent parsers: pro-
       grams that perform pattern matching on text.  They  process  a  Parsing
       Expression  Grammar  (PEG) [Ford	2004] to produce a program that	recog-
       nises legal sentences of	that grammar.  peg processes PEGs written  us-
       ing  the	 original syntax described by Ford; leg	processes PEGs written
       using slightly different	syntax and conventions that  are  intended  to
       make  it	 an  attractive	 replacement for parsers built with lex(1) and
       yacc(1).	 Unlike	lex and	yacc, peg and leg support unlimited backtrack-
       ing, provide ordered choice as a	means for disambiguation, and can com-
       bine scanning (lexical analysis)	and parsing (syntactic analysis)  into
       a single	activity.

       peg  reads  the	specified filenames, or	standard input if no filenames
       are given, for a	grammar	describing the parser to generate.   peg  then
       generates  a  C	source file that defines a function yyparse().	This C
       source file can be included in, or compiled and	then  linked  with,  a
       client  program.	  Each	time  the  client  program calls yyparse() the
       parser consumes input text according to	the  parsing  rules,  starting
       from  the first rule in the grammar.  yyparse() returns non-zero	if the
       input could be parsed according to the grammar; it returns zero if  the
       input could not be parsed.

       The  prefix 'yy'	or 'YY'	is prepended to	all externally-visible symbols
       in the generated	parser.	 This is intended to reduce the	risk of	 name-
       space pollution in client programs.  (The choice	of 'yy'	is historical;
       see lex(1) and yacc(1), for example.)

OPTIONS
       peg and leg provide the following options:

       -h     prints a summary of available options and	then exits.

       -ooutput
	      writes  the  generated  parser to	the file output	instead	of the
	      standard output.

       -P     suppresses #line directives in the output.

       -v     writes verbose information to standard error while working.

       -V     writes version information to standard error then	exits.

A SIMPLE EXAMPLE
       The following peg input specifies a grammar with	a single rule  (called
       'start')	 that  is  satisfied when the input contains the string	"user-
       name".

	   start <- "username"

       (The quotation marks are	not part of the	matched	text;  they  serve  to
       indicate	a literal string to be matched.)  In other words, yyparse() in
       the  generated  C  source  will	return non-zero	only if	the next eight
       characters read from the	input spell the	word "username".  If the input
       contains	anything else, yyparse() returns zero and no input  will  have
       been  consumed.	 (Subsequent calls to yyparse()	will also return zero,
       since the parser	is effectively blocked looking for the	string	"user-
       name".)	 To  ensure  progress  we can add an alternative clause	to the
       'start' rule that will match any	single character if "username" is  not
       found.

	   start <- "username"
		  / .

       yyparse()  now  always  returns non-zero	(except	at the very end	of the
       input).	To do something	useful we can add actions to the rules.	 These
       actions are performed after a complete match is	found  (starting  from
       the  first  rule)  and are chosen according to the 'path' taken through
       the grammar to match the	input.	(Linguists  would  call	 this  path  a
       'phrase marker'.)

	   start <- "username"	  { printf("%s\n", getlogin());	}
		  / < .	>	  { putchar(yytext[0]);	}

       The  first  line	 instructs  the	 parser	to print the user's login name
       whenever	it sees	"username" in the input.  If  that  match  fails,  the
       second  line  tells  the	parser to echo the next	character on the input
       the standard output.  Our parser	is now performing useful work: it will
       copy the	input to the output, replacing all occurrences	of  "username"
       with the	user's account name.

       Note the	angle brackets ('<' and	'>') that were added to	the second al-
       ternative.   These have no effect on the	meaning	of the rule, but serve
       to delimit the text made	available to the following action in the vari-
       able yytext.

       If the above grammar is placed in the file  username.peg,  running  the
       command

	   peg -o username.c username.peg

       will save the corresponding parser in the file username.c.  To create a
       complete	 program  this parser could be included	by a C program as fol-
       lows.

	   #include <stdio.h>	   /* printf(),	putchar() */
	   #include <unistd.h>	   /* getlogin() */

	   #include "username.c"   /* yyparse()	*/

	   int main()
	   {
	     while (yyparse())	   /* repeat until EOF */
	       ;
	     return 0;
	   }

PEG GRAMMARS
       A grammar consists of a set of named rules.

	   name	<- pattern

       The pattern contains one	or more	of the following elements.

       name   The element stands for the entire	pattern	in the rule  with  the
	      given name.

       "characters"
	      A	 character or string enclosed in double	quotes is matched lit-
	      erally.  The ANSI	C escape sequences are recognised  within  the
	      characters.

       'characters'
	      A	 character or string enclosed in single	quotes is matched lit-
	      erally, as above.

       [characters]
	      A	set of characters enclosed in square brackets matches any sin-
	      gle character from the set, with escape characters recognised as
	      above.  If the set begins	with an	uparrow	(^) then  the  set  is
	      negated (the element matches any character not in	the set).  Any
	      pair  of	characters  separated  with  a dash (-)	represents the
	      range of characters from the first to the	second,	inclusive.   A
	      single alphabetic	character or underscore	is matched by the fol-
	      lowing set.

		  [a-zA-Z_]

	      Similarly,  the  following matches  any single non-digit charac-
	      ter.

		  [^0-9]

       .      A	dot matches any	character.  Note that the only time this fails
	      is at the	end of file, where there is no character to match.

       ( pattern )
	      Parentheses are used for grouping	(modifying the	precedence  of
	      the operators described below).

       { action	}
	      Curly braces surround actions.  The action is arbitrary C	source
	      code  to	be executed at the end of matching.  Any braces	within
	      the action must be properly nested.  Any	input  text  that  was
	      matched  before  the action and delimited	by angle brackets (see
	      below) is	made available within the action as  the  contents  of
	      the character array yytext.  The length of (number of characters
	      in) yytext is available in the variable yyleng.  (These variable
	      names are	historical; see	lex(1).)

       <      An opening angle bracket always matches (consuming no input) and
	      causes the parser	to begin accumulating matched text.  This text
	      will be made available to	actions	in the variable	yytext.

       >      A	 closing angle bracket always matches (consuming no input) and
	      causes the parser	to stop	accumulating text for yytext.

       The above elements can be made optional and/or repeatable with the fol-
       lowing suffixes:

       element ?
	      The element is optional.	If present on the input,  it  is  con-
	      sumed  and  the match succeeds.  If not present on the input, no
	      text is consumed and the match succeeds anyway.

       element +
	      The element is repeatable.  If present on	the input, one or more
	      occurrences of element are consumed and the match	succeeds.   If
	      no  occurrences  of  element are present on the input, the match
	      fails.

       element *
	      The element is optional and repeatable.  If present on  the  in-
	      put,  one	 or  more  occurrences of element are consumed and the
	      match succeeds.  If no occurrences of element are	present	on the
	      input, the match succeeds	anyway.

       The above elements and suffixes can be converted	into predicates	 (that
       match  arbitrary	 input	text  and subsequently succeed or fail without
       consuming that input) with the following	prefixes:

       & element
	      The predicate succeeds only if element can  be  matched.	 Input
	      text scanned while matching element is not consumed from the in-
	      put and remains available	for subsequent matching.

       ! element
	      The predicate succeeds only if element cannot be matched.	 Input
	      text scanned while matching element is not consumed from the in-
	      put  and	remains	 available for subsequent matching.  A popular
	      idiom is

		  !.

	      which matches the	end of file, after the last character  of  the
	      input has	already	been consumed.

       A special form of the '&' predicate is provided:

       &{ expression }
	      In  this	predicate  the	simple C expression (not statement) is
	      evaluated	immediately when the parser reaches the	predicate.  If
	      the expression yields non-zero (true) the	'match'	 succeeds  and
	      the  parser  continues with the next element in the pattern.  If
	      the expression yields zero (false) the  'match'  fails  and  the
	      parser backs up to look for an alternative parse of the input.

       Several	elements  (with	 or without prefixes and suffixes) can be com-
       bined into a sequence by	writing	them one after the other.  The	entire
       sequence	 matches  only	if  each individual element within it matches,
       from left to right.

       Sequences can be	separated into disjoint	alternatives by	 the  alterna-
       tion operator '/'.

       sequence-1 / sequence-2 / ... / sequence-N
	      Each  sequence  is  tried	 in turn until one of them matches, at
	      which time matching for the overall pattern succeeds.   If  none
	      of  the  sequences matches then the match	of the overall pattern
	      fails.

       Finally,	the pound sign (#) introduces a	comment	(discarded) that  con-
       tinues until the	end of the line.

       To  summarise  the  above,  the	parser	tries  to match	the input text
       against	a  pattern  containing	literals,  names  (representing	 other
       rules),	and various operators (written as prefixes, suffixes, juxtapo-
       sition for sequencing and and infix alternation operator)  that	modify
       how the elements	within the pattern are matched.	 Matches are made from
       left  to	 right,	 'descending' into named sub-rules as they are encoun-
       tered.  If  the	matching  process  fails,  the	parser	'back  tracks'
       ('rewinding'  the input appropriately in	the process) to	find the near-
       est alternative 'path' through the grammar.  In other words the	parser
       performs	 a  depth-first,  left-to-right	 search	for the	first success-
       fully-matching path through the rules.  If found, the actions along the
       successful path are executed (in	the order they were encountered).

       Note that predicates are	evaluated immediately during the search	for  a
       successful  match,  since  they contribute to the success or failure of
       the search.  Actions, however, are evaluated only  after	 a  successful
       match has been found.

PEG GRAMMAR FOR	PEG GRAMMARS
       The grammar for peg grammars is shown below.  This will both illustrate
       and formalise the above description.

	   Grammar	   <- Spacing Definition+ EndOfFile

	   Definition	   <- Identifier LEFTARROW Expression
	   Expression	   <- Sequence ( SLASH Sequence	)*
	   Sequence	   <- Prefix*
	   Prefix	   <- AND Action
			    / (	AND | NOT )? Suffix
	   Suffix	   <- Primary (	QUERY /	STAR / PLUS )?
	   Primary	   <- Identifier !LEFTARROW
			    / OPEN Expression CLOSE
			    / Literal
			    / Class
			    / DOT
			    / Action
			    / BEGIN
			    / END

	   Identifier	   <- <	IdentStart IdentCont* >	Spacing
	   IdentStart	   <- [a-zA-Z_]
	   IdentCont	   <- IdentStart / [0-9]
	   Literal	   <- ['] < ( !['] Char	 )* > ['] Spacing
			    / ["] < ( !["] Char	 )* > ["] Spacing
	   Class	   <- '[' < ( !']' Range )* > ']' Spacing
	   Range	   <- Char '-' Char / Char
	   Char		   <- '\\' [abefnrtv'"\[\]\\]
			    / '\\' [0-3][0-7][0-7]
			    / '\\' [0-7][0-7]?
			    / '\\' '-'
			    / !'\\' .
	   LEFTARROW	   <- '<-' Spacing
	   SLASH	   <- '/' Spacing
	   AND		   <- '&' Spacing
	   NOT		   <- '!' Spacing
	   QUERY	   <- '?' Spacing
	   STAR		   <- '*' Spacing
	   PLUS		   <- '+' Spacing
	   OPEN		   <- '(' Spacing
	   CLOSE	   <- ')' Spacing
	   DOT		   <- '.' Spacing
	   Spacing	   <- (	Space /	Comment	)*
	   Comment	   <- '#' ( !EndOfLine . )* EndOfLine
	   Space	   <- '	' / '\t' / EndOfLine
	   EndOfLine	   <- '\r\n' / '\n' / '\r'
	   EndOfFile	   <- !.
	   Action	   <- '{' < [^}]* > '}'	Spacing
	   BEGIN	   <- '<' Spacing
	   END		   <- '>' Spacing

LEG GRAMMARS
       leg  is a variant of peg	that adds some features	of lex(1) and yacc(1).
       It differs from peg in the following ways.

       %{ text... %}
	      A	declaration section can	appear anywhere	that a rule definition
	      is expected.  The	text between the delimiters '%{' and  '%}'  is
	      copied  verbatim	to the generated C parser code before the code
	      that implements the parser itself.

       name = pattern
	      The 'assignment' operator	replaces the left arrow	operator '<-'.

       rule-name
	      Hyphens can appear as letters in the names of rules.   Each  hy-
	      phen  is	converted into an underscore in	the generated C	source
	      code.  A single hyphen '-' is a legal rule name.

		  -	  = [ \t\n\r]*
		  number  = [0-9]+		   -
		  name	  = [a-zA-Z_][a-zA_Z_0-9]* -
		  l-paren = '('			   -
		  r-paren = ')'			   -

	      This example shows how ignored whitespace	can  be	 obvious  when
	      reading the grammar and yet unobtrusive when placed liberally at
	      the end of every rule associated with a lexical element.

       seq-1 | seq-2
	      The alternation operator is vertical bar '|' rather than forward
	      slash '/'.  The peg rule

		  name <- sequence-1
			/ sequence-2
			/ sequence-3

	      is therefore written

		  name = sequence-1
		       | sequence-2
		       | sequence-3
		       ;

	      in  leg  (with  the final	semicolon being	optional, as described
	      next).

       @{ action }
	      Actions prefixed with an 'at' symbol will	 be  performed	during
	      parsing, at the time they	are encountered	while matching the in-
	      put text with a rule.  Because of	back-tracking in the PEG pars-
	      ing algorithm, actions prefixed with '@' might be	performed mul-
	      tiple times for the same input text.  (The usual behviour	of ac-
	      tions  is	that they are saved up until matching is complete, and
	      then those that are part of the final derivation	are  performed
	      in  left-to-right	 order.)   The	variable  yytext  is available
	      within these actions.

       exp ~ { action }
	      A	postfix	operator ~{ action } can be placed after  any  expres-
	      sion and will behave like	a normal action	(arbitrary C code) ex-
	      cept  that  it  is  invoked  only	when exp fails.	 It binds less
	      tightly than any other operator except alternation and  sequenc-
	      ing,  and	 is  intended to make error handling and recovery code
	      easier to	write.	Note that yytext and yyleng are	not  available
	      inside  these  actions, but the pointer variable yy is available
	      to give the code access  to  any	user-defined  members  of  the
	      parser  state  (see  "CUSTOMISING	THE PARSER" below).  Note also
	      that exp is always a single expression; to invoke	an  error  ac-
	      tion for any failure within a sequence, parentheses must be used
	      to group the sequence into a single expression.

		  rule = e1 e2 e3 ~{ error("e[12] ok; e3 has failed"); }
		       | ...

		  rule = (e1 e2	e3) ~{ error("one of e[123] has	failed"); }
		       | ...

       pattern ;
	      A	semicolon punctuator can optionally terminate a	pattern.

       %% text...
	      A	 double	 percent  '%%' terminates the rules (and declarations)
	      section of the grammar.  All text	following '%%' is copied  ver-
	      batim  to	the generated C	parser code after the parser implemen-
	      tation code.

       $$ = value
	      A	sub-rule can return a semantic value from an action by assign-
	      ing it to	the pseudo-variable '$$'.  All	semantic  values  must
	      have  the	same type (which defaults to 'int').  This type	can be
	      changed by defining YYSTYPE in a declaration section.

       identifier:name
	      The semantic value returned (by  assigning  to  '$$')  from  the
	      sub-rule	name  is associated with the identifier	and can	be re-
	      ferred to	in subsequent actions.

       The desk	calculator example below illustrates the use of	'$$' and ':'.

LEG EXAMPLE: A DESK CALCULATOR
       The extensions in leg described above allow useful parsers and  evalua-
       tors (including declarations, grammar rules, and	supporting C functions
       such  as	'main')	to be kept within a single source file.	 To illustrate
       this we show a simple desk calculator supporting	the four common	arith-
       metic operators and  named  variables.	The  intermediate  results  of
       arithmetic  evaluation  will be accumulated on an implicit stack	by re-
       turning them as semantic	values from sub-rules.

	   %{
	   #include <stdio.h>	  /* printf() */
	   #include <stdlib.h>	  /* atoi() */
	   int vars[26];
	   %}

	   Stmt	   = - e:Expr EOL		   { printf("%d\n", e);	}
		   | ( !EOL . )* EOL		   { printf("error\n");	}

	   Expr	   = i:ID ASSIGN s:Sum		   { $$	= vars[i] = s; }
		   | s:Sum			   { $$	= s; }

	   Sum	   = l:Product
			   ( PLUS  r:Product	   { l += r; }
			   | MINUS r:Product	   { l -= r; }
			   )*			   { $$	= l; }

	   Product = l:Value
			   ( TIMES  r:Value	   { l *= r; }
			   | DIVIDE r:Value	   { l /= r; }
			   )*			   { $$	= l; }

	   Value   = i:NUMBER			   { $$	= atoi(yytext);	}
		   | i:ID !ASSIGN		   { $$	= vars[i]; }
		   | OPEN i:Expr CLOSE		   { $$	= i; }

	   NUMBER  = < [0-9]+ >	   -		   { $$	= atoi(yytext);	}
	   ID	   = < [a-z]  >	   -		   { $$	= yytext[0] - 'a'; }
	   ASSIGN  = '='	   -
	   PLUS	   = '+'	   -
	   MINUS   = '-'	   -
	   TIMES   = '*'	   -
	   DIVIDE  = '/'	   -
	   OPEN	   = '('	   -
	   CLOSE   = ')'	   -

	   -	   = [ \t]*
	   EOL	   = '\n' | '\r\n' | '\r' | ';'

	   %%

	   int main()
	   {
	     while (yyparse())
	       ;
	     return 0;
	   }

LEG GRAMMAR FOR	LEG GRAMMARS
       The grammar for leg grammars is shown below.  This will both illustrate
       and formalise the above description.

	   grammar =	   -
			   ( declaration | definition )+
			   trailer? end-of-file

	   declaration =   '%{'	< ( !'%}' . )* > RPERCENT

	   trailer =	   '%%'	< .* >

	   definition =	   identifier EQUAL expression SEMICOLON?

	   expression =	   sequence ( BAR sequence )*

	   sequence =	   error+

	   error =	   prefix ( TILDE action )?

	   prefix =	   AND action
	   |		   ( AND | NOT )? suffix

	   suffix =	   primary ( QUERY | STAR | PLUS )?

	   primary =	   identifier COLON identifier !EQUAL
	   |		   identifier !EQUAL
	   |		   OPEN	expression CLOSE
	   |		   literal
	   |		   class
	   |		   DOT
	   |		   action
	   |		   BEGIN
	   |		   END

	   identifier =	   < [-a-zA-Z_][-a-zA-Z_0-9]* >	-

	   literal =	   ['] < ( ![']	char )*	> ['] -
	   |		   ["] < ( !["]	char )*	> ["] -

	   class =	   '[' < ( !']'	range )* > ']' -

	   range =	   char	'-' char | char

	   char	=	   '\\'	[abefnrtv'"\[\]\\]
	   |		   '\\'	[0-3][0-7][0-7]
	   |		   '\\'	[0-7][0-7]?
	   |		   !'\\' .

	   action =	   '{' < braces* > '}' -

	   braces =	   '{' braces* '}'
	   |		   !'}'	.

	   EQUAL =	   '=' -
	   COLON =	   ':' -
	   SEMICOLON =	   ';' -
	   BAR =	   '|' -
	   AND =	   '&' -
	   NOT =	   '!' -
	   QUERY =	   '?' -
	   STAR	=	   '*' -
	   PLUS	=	   '+' -
	   OPEN	=	   '(' -
	   CLOSE =	   ')' -
	   DOT =	   '.' -
	   BEGIN =	   '<' -
	   END =	   '>' -
	   TILDE =	   '~' -
	   RPERCENT =	   '%}'	-

	   - =		   ( space | comment )*
	   space =	   ' ' | '\t' |	end-of-line
	   comment =	   '#' ( !end-of-line .	)* end-of-line
	   end-of-line =   '\r\n' | '\n' | '\r'
	   end-of-file =   !.

CUSTOMISING THE	PARSER
       The following symbols can be redefined in declaration sections to  mod-
       ify the generated parser	code.

       YYSTYPE
	      The semantic value type.	The pseudo-variable '$$' and the iden-
	      tifiers  'bound'	to  rule  results  with	the colon operator ':'
	      should all be considered as being	declared to  have  this	 type.
	      The default value	is 'int'.

       YYPARSE
	      The  name	 of  the  main entry point to the parser.  The default
	      value is 'yyparse'.

       YYPARSEFROM
	      The name of an alternative entry	point  to  the	parser.	  This
	      function expects one argument: the function corresponding	to the
	      rule  from  which	 the search for	a match	should begin.  The de-
	      fault is 'yyparsefrom'.  Note that yyparse() is defined as

		  int yyparse()	{ return yyparsefrom(yy_foo); }

	      where 'foo' is the name of the first rule	in the grammar.

       YY_INPUT(buf, result, max_size)
	      This macro is invoked by the parser to obtain more  input	 text.
	      buf  points  to an area of memory	that can hold at most max_size
	      characters.  The macro should copy input text to	buf  and  then
	      assign  the  integer  variable  result to	indicate the number of
	      characters copied.  If no	more input  is	available,  the	 macro
	      should  assign  0	 to result.  By	default, the YY_INPUT macro is
	      defined as follows.

		  #define YY_INPUT(buf,	result,	max_size)	 \
		  {						 \
		    int	yyc= getchar();				 \
		    result= (EOF == yyc) ? 0 : (*(buf)=	yyc, 1); \
		  }

	      Note that	if YY_CTX_LOCAL	is defined (see	below) then  an	 addi-
	      tional  first argument, containing the parser context, is	passed
	      to YY_INPUT.

       YY_DEBUG
	      If this symbols is defined then additional code will be included
	      in the parser that prints	vast quantities	of arcane  information
	      to the standard error while the parser is	running.

       YY_BEGIN
	      This  macro is invoked to	mark the start of input	text that will
	      be made available	in actions as 'yytext'.	 This  corresponds  to
	      occurrences  of  '<'  in	the grammar.  These are	converted into
	      predicates that are expected to succeed.	The default definition

		  #define YY_BEGIN (yybegin= yypos, 1)

	      therefore	 saves	the  current  input  position  and  returns  1
	      ('true') as the result of	the predicate.

       YY_END This  macros  corresponds	to '>' in the grammar.	Again, it is a
	      predicate	so the default definition saves	the input position be-
	      fore 'succeeding'.

		  #define YY_END (yyend= yypos,	1)

       YY_PARSE(T)
	      This macro declares the parser entry  points  (yyparse  and  yy-
	      parsefrom) to be of type T.  The default definition

		  #define YY_PARSE(T) T

	      leaves  yyparse()	 and yyparsefrom() with	global visibility.  If
	      they should not be externally visible  in	 other	source	files,
	      this macro can be	redefined to declare them 'static'.

		  #define YY_PARSE(T) static T

       YY_CTX_LOCAL
	      If  this	symbol	is  defined  during compilation	of a generated
	      parser then global parser	state will be kept in a	 structure  of
	      type  'yycontext'	 which	can  be	 declared as a local variable.
	      This allows multiple instances of	parsers	to coexist and	to  be
	      thread-safe.  The	parsing	function yyparse() will	be declared to
	      expect  a	 first	argument of type 'yycontext *',	an instance of
	      the structure holding the	global state for the parser.  This in-
	      stance must be allocated and initialised to zero by the  client.
	      A	trivial	but complete example is	as follows.

		  #include <stdio.h>

		  #define YY_CTX_LOCAL

		  #include "the-generated-parser.peg.c"

		  int main()
		  {
		    yycontext ctx;
		    memset(&ctx, 0, sizeof(yycontext));
		    while (yyparse(&ctx));
		    return 0;
		  }

	      Note  that  if this symbol is undefined then the compiled	parser
	      will statically allocate its global state	and  will  be  neither
	      reentrant	 nor thread-safe.  Note	also that the parser yycontext
	      structure	is initialised automatically the first time  yyparse()
	      is called; this structure	must therefore be properly initialised
	      to zero before the first call to yyparse().

       YY_CTX_MEMBERS
	      If   YY_CTX_LOCAL	  is   defined	(see  above)  then  the	 macro
	      YY_CTX_MEMBERS can be defined to expand to any additional	member
	      field declarations that the client would like  included  in  the
	      declaration of the 'yycontext' structure type.  These additional
	      members  are otherwise ignored by	the generated parser.  The in-
	      stance  of  'yycontext'  associated  with	 the  currently-active
	      parser is	available within actions as the	pointer	variable yy.

       YY_BUFFER_SIZE
	      The  initial  size of the	text buffer, in	bytes.	The default is
	      1024 and the buffer size is doubled whenever  required  to  meet
	      demand  during  parsing.	 An  application that typically	parses
	      much longer strings could	increase  this	to  avoid  unnecessary
	      buffer reallocation.

       YY_STACK_SIZE
	      The initial size of the variable and action stacks.  The default
	      is 128, which is doubled whenever	required to meet demand	during
	      parsing.	 Applications that have	deep call stacks with many lo-
	      cal variables, or	that perform many actions after	a single  suc-
	      cessful  match,  could increase this to avoid unnecessary	buffer
	      reallocation.

       YY_MALLOC(YY, SIZE)
	      The memory allocator for all parser-related storage.  The	 para-
	      meters  are  the	current	 yycontext structure and the number of
	      bytes to allocate.  The default definition is: malloc(SIZE)

       YY_REALLOC(YY, PTR, SIZE)
	      The memory reallocator for dynamically-grown  storage  (such  as
	      text  buffers and	variable stacks).  The parameters are the cur-
	      rent yycontext structure,	the previously-allocated storage,  and
	      the  number of bytes to which that storage should	be grown.  The
	      default definition is: realloc(PTR, SIZE)

       YY_FREE(YY, PTR)
	      The memory deallocator.  The parameters are the  current	yycon-
	      text structure and the storage to	deallocate.  The default defi-
	      nition is: free(PTR)

       YYRELEASE
	      The  name	 of the	function that releases all resources held by a
	      yycontext	structure.  The	default	value is 'yyrelease'.

       The following variables can be referred to within actions.

       char *yybuf
	      This variable points to the parser's input buffer	used to	 store
	      input text that has not yet been matched.

       int yypos
	      This  is	the  offset  (in  yybuf)  of  the next character to be
	      matched and consumed.

       char *yytext
	      The most recent matched text delimited by	'<' and	'>' is	stored
	      in this variable.

       int yyleng
	      This variable indicates the number of characters in 'yytext'.

       yycontext *yy
	      This  variable  points to	the instance of	'yycontext' associated
	      with the currently-active	parser.

       Programs	that wish to release  all  the	resources  associated  with  a
       parser can use the following function.

       yyrelease(yycontext*yy)
	      Returns  all  parser-allocated storage associated	with yy	to the
	      system.  The storage will	be reallocated on the next call	to yy-
	      parse().

       Note that the storage for the yycontext structure itself	is never allo-
       cated or	reclaimed implicitly.  The  application	 must  allocate	 these
       structures  in  automatic storage, or use calloc() and free() to	manage
       them explicitly.	 The example in	the following section demonstrates one
       approach	to resource management.

LEG EXAMPLE: EXTENDING THE PARSER'S CONTEXT
       The yy variable passed to actions contains the state of the parser plus
       any additional fields defined by	YY_CTX_MEMBERS.	 Theses	fields can  be
       used to store application-specific information that is global to	a par-
       ticular	call of	yyparse().  A trivial but complete leg example follows
       in which	the yycontext structure	is extended with a count of the	number
       of newline characters seen in the input so far (the  grammar  otherwise
       consumes	 and  ignores the entire input).  The caller of	yyparse() uses
       count to	print the number of lines of input that	were read.

	   %{
	   #define YY_CTX_LOCAL	1
	   #define YY_CTX_MEMBERS \
	     int count;
	   %}

	   Char	   = ('\n' | '\r\n' | '\r')	   { yy->count++ }
		   | .

	   %%

	   #include <stdio.h>
	   #include <string.h>

	   int main()
	   {
	       /* create a local parser	context	in automatic storage */
	       yycontext yy;
	       /* the context *must* be	initialised to zero before first use*/
	       memset(&yy, 0, sizeof(yy));

	       while (yyparse(&yy))
		   ;
	       printf("%d newlines\n", yy.count);

	       /* release all resources	associated with	the context */
	       yyrelease(&yy);

	       return 0;
	   }

DIAGNOSTICS
       peg and leg warn	about the  following  conditions  while	 converting  a
       grammar into a parser.

       syntax error
	      The  input grammar was malformed in some way.  The error message
	      will include the text about to be	matched	 (often	 backed	 up  a
	      huge  amount from	the actual location of the error) and the line
	      number of	the most recently considered character (which is often
	      the real location	of the problem).

       rule 'foo' used but not defined
	      The grammar referred to a	rule named 'foo' but no	definition for
	      it was given.  Attempting	 to  use  the  generated  parser  will
	      likely result in errors from the linker due to undefined symbols
	      associated with the missing rule.

       rule 'foo' defined but not used
	      The grammar defined a rule named 'foo' and then ignored it.  The
	      code  associated	with  the  rule	 is  included in the generated
	      parser which will	in all other respects be healthy.

       possible	infinite left recursion	in rule	'foo'
	      There exists at least one	path through the  grammar  that	 leads
	      from the rule 'foo' back to (a recursive invocation of) the same
	      rule without consuming any input.

       Left  recursion,	especially that	found in standards documents, is often
       'direct'	and implies trivial repetition.

	   # (6.7.6)
	   direct-abstract-declarator =
	       LPAREN abstract-declarator RPAREN
	   |   direct-abstract-declarator? LBRACKET assign-expr? RBRACKET
	   |   direct-abstract-declarator? LBRACKET STAR RBRACKET
	   |   direct-abstract-declarator? LPAREN param-type-list? RPAREN

       The recursion can easily	be eliminated by converting the	parts  of  the
       pattern following the recursion into a repeatable suffix.

	   # (6.7.6)
	   direct-abstract-declarator =
	       direct-abstract-declarator-head?
	       direct-abstract-declarator-tail*

	   direct-abstract-declarator-head =
	       LPAREN abstract-declarator RPAREN

	   direct-abstract-declarator-tail =
	       LBRACKET	assign-expr? RBRACKET
	   |   LBRACKET	STAR RBRACKET
	   |   LPAREN param-type-list? RPAREN

CAVEATS
       A  parser  that	accepts	empty input will always	succeed.  Consider the
       following example, not atypical of a first attempt to write a PEG-based
       parser:

	   Program = Expression*
	   Expression =	"whatever"
	   %%
	   int main() {
	     while (yyparse())
	       puts("success!");
	     return 0;
	   }

       This program loops forever, no matter what (if any) input  is  provided
       on  stdin.   Many  fixes	are possible, the easiest being	to insist that
       the parser always consumes some non-empty input.	  Changing  the	 first
       line to

	   Program = Expression+

       accomplishes this.  If the parser is expected to	consume	the entire in-
       put,  then  explicitly  requiring the end-of-file is also highly	recom-
       mended:

	   Program = Expression+ !.

       This works because the parser will only fail to match  ("!"  predicate)
       any  character  at all ("." expression) when it attempts	to read	beyond
       the end of the input.

BUGS
       You have	to type	'man peg' to read the manual page for leg(1).

       The 'yy'	and 'YY' prefixes cannot be changed.

       Left recursion is detected in the input grammar but is not handled cor-
       rectly in the generated parser.

       Diagnostics for errors in the input grammar are obscure and not partic-
       ularly helpful.

       The operators ! and ~ should really be named the	other way around.

       Several commonly-used lex(1) features (yywrap(),	yyin, etc.)  are  com-
       pletely absent.

       The  generated  parser  does not	contain	'#line'	directives to direct C
       compiler	errors back to the grammar description when appropriate.

SEE ALSO
       D. Val Schorre, META II,	a syntax-oriented compiler  writing  language,
       19th  ACM  National  Conference,	1964, pp. 41.301--41.311.  Describes a
       self-implementing parser	generator for analytic grammars	with no	 back-
       tracking.

       Alexander  Birman,  The	TMG  Recognition  Schema,  Ph.D. dissertation,
       Princeton, 1970.	 A mathematical	treatment of the power and  complexity
       of recursive-descent parsing with backtracking.

       Bryan  Ford, Parsing Expression Grammars: A Recognition-Based Syntactic
       Foundation, ACM SIGPLAN Symposium on  Principles	 of  Programming  Lan-
       guages,	2004.	Defines	 PEGs  and  analyses  them in relation to con-
       text-free and regular grammars.	Introduces the syntax adopted in peg.

       The standard Unix utilities lex(1) and  yacc(1)	which  influenced  the
       syntax and features of leg.

       The source code for peg and leg whose grammar parsers are written using
       themselves.

       The latest version of this software and documentation:

	   http://piumarta.com/software/peg

AUTHOR
       peg,  leg and this manual page were written by Ian Piumarta (first-name
       at last-name dot	com) while investigating the viability of regular  and
       parsing-expression  grammars for	efficiently extracting type and	signa-
       ture information	from C header files.

       Please send bug reports and suggestions for improvements	to the	author
       at the above address.

Version	0.1			September 2013				PEG(1)
Want to link to this manual page? Use this URL:
<https://man.freebsd.org/cgi/man.cgi?query=peg&sektion=1&manpath=FreeBSD+Ports+15.0>
home | help
Header And Logo

Peripheral Links

Site Navigation

FreeBSD Manual Pages

Header And Logo

Peripheral Links

Search

Site Navigation

FreeBSD Manual Pages