|
1 | 1 | If anyone stumbles here trying to find an incremental parser, here is the 1998 paper ([Efficient and Flexible Incremental Parsing](https://www.researchgate.net/profile/SL_Graham/publication/2377179_Efficient_and_Flexible_Incremental_Parsing/links/004635294e13f23ef1000000/Efficient-and-Flexible-Incremental-Parsing.pdf
|
2 | 2 | )) that I should have found before embarking in this failed (expectations not met) project.
|
3 | 3 |
|
4 |
| -# Parsley |
5 |
| - |
6 |
| -## Parsnip |
7 |
| - |
8 |
| -Parsley has been a test bed and a proof of concept for total incremental parsers. However it suffers from severe limitations (mainly revolving around lookaheads, both at the lexeme and production level) which hinder further development and acceptance. |
9 |
| - |
10 |
| -Further development of the concepts and techniques explored in Parsley will occur in [Parsnip](https://github.com/cgrand/parsnip/). |
11 |
| - |
12 |
| -## Introduction |
13 |
| - |
14 |
| -Parsley generates *total and truly incremental parsers*. |
15 |
| - |
16 |
| -Total: a Parsley parser *yields a parse-tree for any input string*. |
17 |
| - |
18 |
| -Truly incremental: a Parsley parser can operate as a text buffer, in best cases |
19 |
| -recomputing the parse-tree after a sequence of edits happens in *logarithmic |
20 |
| -time* (worst case: it behaves like a restartable parser). |
21 |
| - |
22 |
| -Parsley parsers have *no separate lexer*, this allows for better compositionality |
23 |
| -of grammars. |
24 |
| - |
25 |
| -For now Parsley uses the same technique (for lexer-less parsing) as described |
26 |
| -in this paper: |
27 |
| -Context-Aware Scanning for Parsing Extensible Languages |
28 |
| -http://www.umsec.umn.edu/publications/Context-Aware-Scanning-Parsing-Extensible-Language |
29 |
| - |
30 |
| -(I independently rediscovered this technique and dubbed it LR+.) |
31 |
| - |
32 |
| -Without a separate lexer, a language is entirely defined by its grammar. |
33 |
| -A grammar is an alternation of keywords (non-terminal names) and other values. |
34 |
| -A keyword and another value form a production rule. |
35 |
| - |
36 |
| - |
37 |
| -## Specifying grammars |
38 |
| - |
39 |
| -A simple grammar is: |
40 |
| - |
41 |
| - :expr #{"x" ["(" :expr* ")"]} |
42 |
| - |
43 |
| -`x` `()` `(xx)` `((x)())` are recognized by this grammar. |
44 |
| - |
45 |
| -By default the main production of a grammar is the first one. |
46 |
| - |
47 |
| -A production right value is a combination of: |
48 |
| - |
49 |
| -* strings and regexes (terminals -- the set of terminal types is broader and |
50 |
| - even open, more later) |
51 |
| -* keywords (non-terminals) which can be suffixed by `*`, `+` or `?` to denote |
52 |
| - repetitions or options. |
53 |
| -* sets to denote an alternative |
54 |
| -* vectors to denote a sequence. Inside vectors `:*`, `:+` and `:?` are postfix unary |
55 |
| - operators. That is `["ab" :+]` denotes a non-empty repetition of the `ab` |
56 |
| - string |
57 |
| - |
58 |
| -A production left value is always a keyword. If this keyword is suffixed by `-`, |
59 |
| -no node will be generated in the parse-tree for this rule, its child nodes are |
60 |
| -inlined in the parent node. Rules with such names are called anonymous rules. |
61 |
| -An anonymous rule must be referred to by its base name (without the `-`). |
62 |
| - |
63 |
| -These two grammars specify the same language but the resulting parse-trees will |
64 |
| -be different (additional `:expr-rep` nodes): |
65 |
| - |
66 |
| - :expr #{"x" ["(" :expr* ")"]} |
67 |
| - |
68 |
| - :expr #{"x" :expr-rep} |
69 |
| - :expr-rep ["(" :expr* ")"] |
70 |
| - |
71 |
| -These two grammars specify the same language and the same parse-trees: |
72 |
| - |
73 |
| - :expr #{"x" ["(" :expr* ")"]} |
74 |
| - |
75 |
| - :expr #{"x" :expr-rep} |
76 |
| - :expr-rep- ["(" :expr* ")"] |
77 |
| - |
78 |
| -## Creating parsers |
79 |
| - |
80 |
| -A parser is created using the `parser` or `make-parser` functions. |
81 |
| - |
82 |
| - (require '[net.cgrand.parsley :as p]) |
83 |
| - (def p (p/parser :expr #{"x" ["(" :expr* ")"]})) |
84 |
| - (pprint (p "(x(x))")) |
85 |
| - |
86 |
| - {:tag :net.cgrand.parsley/root, |
87 |
| - :content |
88 |
| - [{:tag :expr, |
89 |
| - :content |
90 |
| - ["(" |
91 |
| - {:tag :expr, :content ["x"]} |
92 |
| - {:tag :expr, :content ["(" {:tag :expr, :content ["x"]} ")"]} |
93 |
| - ")"]}]} |
94 |
| - |
95 |
| - ; running on malformed input with garbage |
96 |
| - (pprint (p "a(zldxn(dez)")) |
97 |
| - |
98 |
| - {:tag :net.cgrand.parsley/unfinished, |
99 |
| - :content |
100 |
| - [{:tag :net.cgrand.parsley/unexpected, :content ["a"]} |
101 |
| - {:tag :net.cgrand.parsley/unfinished, |
102 |
| - :content |
103 |
| - ["(" |
104 |
| - {:tag :net.cgrand.parsley/unexpected, :content ["zld"]} |
105 |
| - {:tag :expr, :content ["x"]} |
106 |
| - {:tag :net.cgrand.parsley/unexpected, :content ["n"]} |
107 |
| - {:tag :expr, |
108 |
| - :content |
109 |
| - ["(" |
110 |
| - {:tag :net.cgrand.parsley/unexpected, :content ["dez"]} |
111 |
| - ")"]}]}]} |
112 |
| - |
113 |
| - |
114 |
| -## Creating buffers |
115 |
| - |
116 |
| -Creating a buffer, editing it and getting its resulting parse-tree: |
117 |
| - |
118 |
| - (-> p p/incremental-buffer (p/edit 0 0 "(") (p/edit 1 0 "(x)") p/parse-tree pprint) |
119 |
| - |
120 |
| - {:tag :net.cgrand.parsley/unfinished, |
121 |
| - :content |
122 |
| - [{:tag :net.cgrand.parsley/unfinished, |
123 |
| - :content |
124 |
| - ["(" |
125 |
| - {:tag :expr, :content ["(" {:tag :expr, :content ["x"]} ")"]}]}]} |
126 |
| - |
127 |
| -Incremental parsing at work: |
128 |
| - |
129 |
| - => (def p (p/parser :expr #{"x" "\n" ["(" :expr* ")"]})) |
130 |
| - #'net.cgrand.parsley/p |
131 |
| - => (let [line (apply str "\n" (repeat 10 "((x))")) |
132 |
| - input (str "(" (apply str (repeat 1000 line)) ")") |
133 |
| - buf (p/incremental-buffer p) |
134 |
| - buf (p/edit buf 0 0 input)] |
135 |
| - (time (p/parse-tree buf)) |
136 |
| - (time (p/parse-tree (-> buf (p/edit 2 0 "(") (p/edit 51002 0 ")")))) |
137 |
| - nil) |
138 |
| - "Elapsed time: 508.834 msecs" |
139 |
| - "Elapsed time: 86.038 msecs" |
140 |
| - nil |
141 |
| - |
142 |
| -Hence, *reparsing the buffer only took a fraction of the original time* despite |
143 |
| -the buffer having been modified at the start and at the end. |
144 |
| - |
145 |
| -## Incremental parsing |
146 |
| - |
147 |
| -The input string is split into _chunks_ (lines by default) and chunks are always |
148 |
| -reparsed as a whole, so don't experiment with incremental parsing with 1-line |
149 |
| -inputs! |
150 |
| - |
151 |
| -Let's look at a bit more complex example: |
152 |
| - |
153 |
| - => (def p (p/parser {:main :expr* |
154 |
| - :space :ws? |
155 |
| - :make-node (fn [tag content] {:tag tag :content content :id (gensym)})} |
156 |
| - :ws #"\s+" |
157 |
| - :expr #{#"\w+" ["(" :expr* ")"]})) |
158 |
| - |
159 |
| -This example introduces the option map: if the first arg to `parser` is a map |
160 |
| -(instead of a keyword), it's a map of options. See Options for more. |
161 |
| - |
162 |
| -The important option here is that we redefine how nodes of the parse-tree are |
163 |
| -constructed (via the `make-node` option). We add a unique identifier to each |
164 |
| -node. |
165 |
| - |
166 |
| -Now let's create a 3-line input and parse it: |
167 |
| - |
168 |
| - => (def buf (-> p incremental-buffer (edit 0 0 "((a)\n(b)\n(c))"))) |
169 |
| - => (-> buf parse-tree pprint) |
170 |
| - nil |
171 |
| - {:tag :net.cgrand.parsley/root, |
172 |
| - :content |
173 |
| - [{:tag :expr, |
174 |
| - :content |
175 |
| - ["(" |
176 |
| - {:tag :expr, |
177 |
| - :content ["(" {:tag :expr, :content ["a"], :id G__1806} ")"], |
178 |
| - :id G__1807} |
179 |
| - {:tag :ws, :content ["\n"], :id G__1808} |
180 |
| - {:tag :expr, |
181 |
| - :content ["(" {:tag :expr, :content ["b"], :id G__1809} ")"], |
182 |
| - :id G__1810} |
183 |
| - {:tag :ws, :content ["\n"], :id G__1811} |
184 |
| - {:tag :expr, |
185 |
| - :content ["(" {:tag :expr, :content ["c"], :id G__1812} ")"], |
186 |
| - :id G__1813} |
187 |
| - ")"], |
188 |
| - :id G__1814}], |
189 |
| - :id G__1815} |
190 |
| - |
191 |
| -Now, let's modify this "B" in "BOO" and parse the buffer again: |
192 |
| - |
193 |
| - => (-> buf (edit 6 1 "BOO") parse-tree pprint) |
194 |
| - nil |
195 |
| - {:tag :net.cgrand.parsley/root, |
196 |
| - :content |
197 |
| - [{:tag :expr, |
198 |
| - :content |
199 |
| - ["(" |
200 |
| - {:tag :expr, |
201 |
| - :content ["(" {:tag :expr, :content ["a"], :id G__1806} ")"], |
202 |
| - :id G__1807} |
203 |
| - {:tag :ws, :content ["\n"], :id G__1818} |
204 |
| - {:tag :expr, |
205 |
| - :content ["(" {:tag :expr, :content ["BOO"], :id G__1819} ")"], |
206 |
| - :id G__1820} |
207 |
| - {:tag :ws, :content ["\n"], :id G__1811} |
208 |
| - {:tag :expr, |
209 |
| - :content ["(" {:tag :expr, :content ["c"], :id G__1812} ")"], |
210 |
| - :id G__1813} |
211 |
| - ")"], |
212 |
| - :id G__1821}], |
213 |
| - :id G__1822} |
214 |
| ------ |
215 |
| - |
216 |
| -We can spot that 5 out of the 10 nodes are shared with the previous parse-tree. |
217 |
| - |
218 |
| - |
219 |
| -## Options |
220 |
| - |
221 |
| -`:main` specifies the root production, by default this is the the first |
222 |
| -production of the grammar. |
223 |
| - |
224 |
| -`:root-tag` specifies the tag name to use for the root node |
225 |
| -(`:net.cgrand.parsley/root` by default). |
226 |
| - |
227 |
| -`:space` specifies a production which will be interspersed between every symbol |
228 |
| -(terminal or not) *except in a sequence created with `unspaced`.* |
229 |
| - |
230 |
| -`:make-node` specifies a function whose arglist is `[tag children-vec]` which |
231 |
| -returns a new node. By default create instances the Node record with keys `tag` |
232 |
| -and `content`. |
233 |
| - |
234 |
| -`:make-unexpected` specifies a 1-arg function which converts a string (of |
235 |
| -unexpected characters) to a node. By defaut delegates to `:make-node`. |
236 |
| - |
237 |
| -`:make-leaf` specifies a 1-arg function which converts a string (token) to a |
238 |
| -node, by default behaves like identity. |
| 4 | +If you are still interested in Parsley, go read the old [README](DONTREADME.md) |
0 commit comments