These are my reading notes on the following paper.
Introduction
-
Most syntax theory is in generative systems
-
Generative systems bridge distance between descriptions to practical recognizers.
-
PEGs
-
are similar to EBNF, the key difference is the prioritized choice operator.
-
may be viewed as a formal description of a top-down parser.
-
Can express all deterministic LR(k) languages and some non-context-free languages.
-
Can be parsed in linear time using a tabular or memoizing parser.
-
Parsing Expression Grammars
-
Syntax
-
Single or double quotes delimit string literals.
-
Regex like expressions that are greedy.
-
Non-greedy behavior available through syntactic predicates.
-
/ operator is a prioritized choice
-
& and ! delimit syntactic predicates.
-
&e matches e, then backtracks, only indicating whether or not e matched.
-
! negates e
-
Lookahead techniques.
-
-
-
Unified grammar allows context-sensitive constructions
-
Priorities, not ambiguities
-
Left-recursion not availalbe, as in all top-down parsers.
Formal Development of PEGs
Most of the content in this section are simply quotes of the important formal definitions.
-
Formal definition a 4-tuple G = (VN ,VT , R, eS ), where VN is a finite set of nonterminal symbols, VT is a finite set of terminal symbols, R is a finite set of rules, eS is a parsing expression termed the start expression, and VN ∩ VT = / 0. Each rule r ∈ R is a pair (A, e), which we write A ← e, where A ∈ VN and e is a parsing expression. For any nonterminal A, there is exactly one e such that A ← e ∈ R. R is therefore a function from nonterminals to expressions, and we write R(A) to denote the unique expression e such that A ← e ∈ R.
-
Abstract syntax does not contain:
-
Any character constant (.).
-
The optional operator (?).
-
The one-or-more operator (+).
-
The and-predicate operator (&).
-
-
Interpretation of grammar: To formalize the syntactic meaning of a grammar G = (VN ,VT, R, eS), we define a relation ⇒G from pairs of the form (e, x) to pairs of the form (n, o), where e is a parsing expression, x ∈ VT is an input string to be recognized, n ≥ 0 serves as a “step counter,” and o ∈ VT ∪ { f } indicates the result of a recognition attempt. The “output” o of a successful match is the portion of the input string recognized and “consumed,” while a distinguished sym- bol f ∈ VT indicates failure. For ((e, x), (n, o)) ∈⇒G we will write (e, x) ⇒ (n, o), with the reference to G being implied.
-
The language L(G) of a PEG G = (V ,VT, R, eS) is the set of strings x ∈ VT for which the start expression eS matches x. Note that the start expression eS only needs to succeed on input string x for x to be included in L(G); eS need not consume all of string x.
-
A language L over an alphabet V is a parsing expression language (PEL) iff there exists a parsing expression grammar G whose language is L.
-
Two PEGs G1 and G2 are equivalent if they recognize the same language: L(G1) = L(G2).
-
The equivalence of two arbitrary PEGs is undecidable.
-
It is undecidable whether an arbitrary grammar is complete: that is, whether it either succeeds or fails on all input strings.
-
Grammar identities—transformations made to a PEG without altering the language represented.
-
Sequence and alternation operators are associative.
-
Sequence operators can be distributed into choice operators on the left but not on the right.
-
Predicates can be moved left within sequences distributively .
-
A choice expression e1 /e2 is commutative if its subexpressions are disjoint.
-
Reductions on PEGs
-
Any repetition expression e* can be eliminated by replacing it with a new nonterminal A with the definition A ← e A/ε.
-
For any PEG G, an equivalent repetition-free grammar G’ can be created.
-
Elminating predicates.
-
Predicates can be eliminated from any well-formed grammar whose language does not include the empty string.
-
Happens in three stages.
-
Add the following nonterminals:
-
T ← .
-
The nonterminal Z matches and consumes any input string; to avoid introducing repetition operators, we define it Z ← T Z/ε.
-
The nonterminal F always fails; to avoid using predi- cates we define it F ← ZT .
-
Define a function to normalize it.
-
-
Convert grammar to one in which all nonterminals either succeed and consume a nonempty input prefix or fail.
-
Analogous to ε reduction on CFGs.
-
Define a function g0 to split expression into ε-only and ε-free portions.
-
If G is well-formed, g0 terminates.
-
-
We define a function d, such that d(A, e) “distributes” a nonterminal A into an ε-only expression e resulting from the stage 2 function g0.
-
-
-
PEGs can be reduced to to TS/TDPL.
-
PEGs can be reduced to gTS/GTDPL.
Terms
- generative systems
-
Language is defined formally by a set of rules applied recursively to generate strings of the language.
- recognition-based systems
-
Defines a language in terms of rules or predicates that decide whether or not a given string is the language.
- prioritized choice operator
-
Denoted by a /, used to indicate that rules should be tested in order, with the first match selected.
- repetition-free grammar
-
Grammar whose expressin set contains only expressions constructed without the use of *.
- subroutine failures
-
Grammars that contain undefined references.
- parsing expression languages
-
The class of languages that cana be expressed by PEGs.
- partial acceptance failures
-
Expressions that fail in TS and gTS because they only partailly consume input strings.
- well-formed grammar
-
A grammar that contains no directly or mutually left-recursive rules.