These are my reading notes on the following paper.

Introduction

  1. Most syntax theory is in generative systems

  2. Generative systems bridge distance between descriptions to practical recognizers.

  3. PEGs

    1. are similar to EBNF, the key difference is the prioritized choice operator.

    2. may be viewed as a formal description of a top-down parser.

    3. Can express all deterministic LR(k) languages and some non-context-free languages.

    4. Can be parsed in linear time using a tabular or memoizing parser.

Parsing Expression Grammars

  1. Syntax

    1. Single or double quotes delimit string literals.

    2. Regex like expressions that are greedy.

    3. Non-greedy behavior available through syntactic predicates.

    4. / operator is a prioritized choice

    5. & and ! delimit syntactic predicates.

      1. &e matches e, then backtracks, only indicating whether or not e matched.

      2. ! negates e

      3. Lookahead techniques.

  2. Unified grammar allows context-sensitive constructions

  3. Priorities, not ambiguities

  4. Left-recursion not availalbe, as in all top-down parsers.

Formal Development of PEGs

Most of the content in this section are simply quotes of the important formal definitions.

  1. Formal definition a 4-tuple G = (VN ,VT , R, eS ), where VN is a finite set of nonterminal symbols, VT is a finite set of terminal symbols, R is a finite set of rules, eS is a parsing expression termed the start expression, and VN ∩ VT = / 0. Each rule r ∈ R is a pair (A, e), which we write A ← e, where A ∈ VN and e is a parsing expression. For any nonterminal A, there is exactly one e such that A ← e ∈ R. R is therefore a function from nonterminals to expressions, and we write R(A) to denote the unique expression e such that A ← e ∈ R.

  2. Abstract syntax does not contain:

    1. Any character constant (.).

    2. The optional operator (?).

    3. The one-or-more operator (+).

    4. The and-predicate operator (&).

  3. Interpretation of grammar: To formalize the syntactic meaning of a grammar G = (VN ,VT, R, eS), we define a relation ⇒G from pairs of the form (e, x) to pairs of the form (n, o), where e is a parsing expression, x ∈ VT is an input string to be recognized, n ≥ 0 serves as a “step counter,” and o ∈ VT ∪ { f } indicates the result of a recognition attempt. The “output” o of a successful match is the portion of the input string recognized and “consumed,” while a distinguished sym- bol f ∈ VT indicates failure. For ((e, x), (n, o)) ∈⇒G we will write (e, x) ⇒ (n, o), with the reference to G being implied.

  4. The language L(G) of a PEG G = (V ,VT, R, eS) is the set of strings x ∈ VT for which the start expression eS matches x. Note that the start expression eS only needs to succeed on input string x for x to be included in L(G); eS need not consume all of string x.

  5. A language L over an alphabet V is a parsing expression language (PEL) iff there exists a parsing expression grammar G whose language is L.

  6. Two PEGs G1 and G2 are equivalent if they recognize the same language: L(G1) = L(G2).

  7. The equivalence of two arbitrary PEGs is undecidable.

  8. It is undecidable whether an arbitrary grammar is complete: that is, whether it either succeeds or fails on all input strings.

  9. Grammar identities—transformations made to a PEG without altering the language represented.

    1. Sequence and alternation operators are associative.

    2. Sequence operators can be distributed into choice operators on the left but not on the right.

    3. Predicates can be moved left within sequences distributively .

    4. A choice expression e1 /e2 is commutative if its subexpressions are disjoint.

Reductions on PEGs

  1. Any repetition expression e* can be eliminated by replacing it with a new nonterminal A with the definition A ← e A/ε.

  2. For any PEG G, an equivalent repetition-free grammar G’ can be created.

  3. Elminating predicates.

    1. Predicates can be eliminated from any well-formed grammar whose language does not include the empty string.

    2. Happens in three stages.

      1. Add the following nonterminals:

        1. T ← .

        2. The nonterminal Z matches and consumes any input string; to avoid introducing repetition operators, we define it Z ← T Z/ε.

        3. The nonterminal F always fails; to avoid using predi- cates we define it F ← ZT .

        4. Define a function to normalize it.

      2. Convert grammar to one in which all nonterminals either succeed and consume a nonempty input prefix or fail.

        1. Analogous to ε reduction on CFGs.

        2. Define a function g0 to split expression into ε-only and ε-free portions.

        3. If G is well-formed, g0 terminates.

      3. We define a function d, such that d(A, e) “distributes” a nonterminal A into an ε-only expression e resulting from the stage 2 function g0.

  4. PEGs can be reduced to to TS/TDPL.

  5. PEGs can be reduced to gTS/GTDPL.

Terms

generative systems

Language is defined formally by a set of rules applied recursively to generate strings of the language.

recognition-based systems

Defines a language in terms of rules or predicates that decide whether or not a given string is the language.

prioritized choice operator

Denoted by a /, used to indicate that rules should be tested in order, with the first match selected.

repetition-free grammar

Grammar whose expressin set contains only expressions constructed without the use of *.

subroutine failures

Grammars that contain undefined references.

parsing expression languages

The class of languages that cana be expressed by PEGs.

partial acceptance failures

Expressions that fail in TS and gTS because they only partailly consume input strings.

well-formed grammar

A grammar that contains no directly or mutually left-recursive rules.