-
Notifications
You must be signed in to change notification settings - Fork 90
/
Copy pathPARSER_DOC
311 lines (265 loc) · 14.6 KB
/
PARSER_DOC
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
PARSER_DOC written by Mitchell Foral
Overview:
I will assume the reader has a decent knowledge of how Ragel works and the
Ragel syntax. If not, please review the Ragel manual found at:
http://research.cs.queensu.ca/~thurston/ragel/
All parsers must at least:
* Call a callback function when a line of code is parsed.
* Call a callback function when a line of comment is parsed.
* Call a callback function when a blank line is parsed.
Additionally a parser can call the callback function for each position of
entities parsed.
Take a look at c.rl and even keep it open for reference when reading this
document to better understand how parsers work and how to write one.
Writing a Parser:
First create your parser in ext/ohcount_native/ragel_parsers/. Its name
should be the language you are parsing with a '.rl' extension. You will not
have to manually compile any parsers, as the Rakefile does this automatically
for you. Every parser must have the following at the top:
/************************* Required for every parser *************************/
#ifndef RAGEL_C_PARSER
#define RAGEL_C_PARSER
#include "ragel_parser_macros.h"
// the name of the language
const char *C_LANG = "c";
// the languages entities
const char *c_entities[] = {
"space", "comment", "string", "number", "preproc",
"keyword", "identifier", "operator", "any"
};
// constants associated with the entities
enum {
C_SPACE = 0, C_COMMENT, C_STRING, C_NUMBER, C_PREPROC,
C_KEYWORD, C_IDENTIFIER, C_OPERATOR, C_ANY
};
/*****************************************************************************/
And the following at the bottom:
/************************* Required for every parser *************************/
/* Parses a string buffer with C/C++ code.
*
* @param *buffer The string to parse.
* @param length The length of the string to parse.
* @param count Integer flag specifying whether or not to count lines. If yes,
* uses the Ragel machine optimized for counting. Otherwise uses the Ragel
* machine optimized for returning entity positions.
* @param *callback Callback function. If count is set, callback is called for
* every line of code, comment, or blank with 'lcode', 'lcomment', and
* 'lblank' respectively. Otherwise callback is called for each entity found.
*/
void parse_c(char *buffer, int length, int count,
void (*callback) (const char *lang, const char *entity, int start, int end)
) {
init
%% write init;
cs = (count) ? c_en_c_line : c_en_c_entity;
%% write exec;
// if no newline at EOF; callback contents of last line
if (count) { process_last_line(C_LANG) }
}
#endif
/*****************************************************************************/
(Your parser will go between these two blocks.)
The code can be found in the existing c.rl parser. You will need to change:
* RAGEL_[lang]_PARSER - Replace [lang] with your language name. So if you
are writing a C parser, it would be RAGEL_C_PARSER.
* [lang]_LANG - Set the variable name to be [lang]_LANG and its value to be
the name of your language to parse. [lang] is your language name. For C it
would be C_LANG.
* [lang]_entities - Set the variable name to be [lang]_entities (e.g.
c_entries) The value is an array of string entities your language has.
For example C has comment, string, number, etc. entities. You should
definately have "space", and "any" entities. "any" entities are typically
used for entity machines (discussed later) and match any character that
is not recognized so the parser does not do something unpredictable.
* enum - Change the value of the enum to correspond with your entities. So
if in your parser you look up [lang]_entities[ENTITY], you will get the
associated entity's string name.
* parse_[lang] - Set the function name to parse_[lang] where again, [lang]
is the name of your language. In the case of C, it is parse_c.
* [lang]_en_[lang]_line - The line counting machine.
* [lang]_en_[lang]_entity - The entity machine.
You may be asking why you have to rename variables and functions. Well if
variables have the same name in header files (which is what parsers are),
the compiler complains. Also, when you have languages embedded inside each
other, any identifiers with the same name can easily be mixed up. It is also
important to prefix your Ragel definitions with your language to avoid
conflicts with other parsers.
Additional variables available to parsers are in the "ragel_parser_macros.h"
file. Take a look at it and try to understand what the variables are used for.
They will make more sense later on.
Now you can define your Ragel parser. Name your machine after your language,
'write data', and include 'common.rl', a file with common Ragel definitions,
actions, etc. For example:
%%{
machine c;
write data;
include "common.rl";
...
}%%
Before you begin to write patterns for each entity in your language, you need
to understand how the parser should work.
Each parser has two machines: one optimized for counting lines of code,
comments, and blanks; the other for identifying entity positions in the
buffer.
Line Counting Machine:
This machine should be written as a line-by-line parser for multiple lines.
This means you match any combination of entities except a newline up until
you do reach a newline. If the line contains only spaces, or nothing at all,
it is blank. If the line contains spaces at first, but then a comment, or
just simply a comment, the line is a comment. If the line contains anything
but a comment after spaces (if there are any), it is a line of code. You
will do this using a Ragel scanner.
The callback function will be called for each line parsed.
Scanner Parser Structure:
A scanner parser will look like this:
[lang]_line := |*
entity1 ${ entity = ENTITY1; } => [lang]_ccallback;
entity2 ${ entity = ENTITY2; } => [lang]_ccallback;
...
entityn ${ entity = ENTITYN; } => [lang]_ccallback;
*|;
(As usual, replace [lang] with your language name.)
Each entity is the pattern for an entity to match, the last one typically
being the newline entity. For each match, the variable is set to a
constant defined in the enum, and the main action is called (you will need
to create this action above the scanner).
When you detect whether or not a line is code or comment, you should call
the appropriate 'code' or 'comment' action defined in common.rl as soon
as possible. It is not necessary to worry about whether or not these
actions are called more than once for a given line; the first call to
either sets the status of the line permanently. Sometimes you cannot call
'code' or 'comment' for one reason or another. Do not worry, as this is
discussed later.
When you reach a newline, you will need to decide whether the current line
is a line of code, comment, or blank. This is easy. Simply check if the
line_contains_code or whole_line_comment variables are set to 1. If
neither of them are, the line is blank. Then call the callback function
(not action) with an "lcode", "lcomment", or "lblank" string, and the
start and end positions of that line (including the newline). The start
position of the line is in the line_start variable. It should be set at
the beginning of every line either through the 'code' or 'comment'
actions, or manually in the main action. Finally the line_contains_code,
whole_line_comment, and line_start state variables must be reset. All this
should be done within the main action shown below.
Note: For most parsers, the std_newline(lang) macro is sufficient and does
everything in the main action mentioned above. The lang parameter is the
[lang]_LANG string.
Main Action Structure:
The main action looks like this:
action [lang]_ccallback {
switch(entity) {
when ENTITY1:
...
break;
when ENTITY2:
...
break;
...
when ENTITYN:
...
break;
}
}
Defining Patterns for Entities:
Now it is time to write patterns for each entity in your language. That
does not seem very hard, except when your entity can cover multiple lines.
Comments and strings in particular can do this. To make an accurate line
counter, you will need to count the lines covered by multi-line entities.
When you detect a newline inside your multi-line entity, you should set
the entity variable to be INTERNAL_NL (-2) and call the main action. The
main action should have a case for INTERNAL_NL separate from the newline
entity. In it, you will check if the current line is code or comment and
call the callback function with the appropriate string ("lcode" or
"lcomment") and beginning and end of the line (including the newline).
Afterwards, you will reset the line_contains_code and whole_line_comment
state variables. Then set the line_start variable to be p, the current
Ragel buffer position. Because line_contains_code and whole_line_comment
have been reset, any non-newline and non-space character in the multi-line
pattern should set line_contains_code or whole_line_comment back to 1.
Otherwise you would count the line as blank.
Note: For most parsers, the std_internal_newline(lang) macro is sufficient
and does everything in the main action mentioned above. The lang parameter
is the [lang]_LANG string.
For multi-line matches, it is important to call the 'code' or 'comment'
actions (mentioned earlier) before an internal newline is detected so the
line_contains_code and whole_line_comment variables are properly set. For
other entities, you can use the 'code' macro inside the main action which
executes the same code as the Ragel 'code' action. Other C macros are
'comment' and 'ls', the latter is typically used for the SPACE entity when
defining line_start.
Also for multi-line matches, it may be necessary to use the 'enqueue' and
'commit' actions. If it is possible that a multi-line entity will not have
an ending delimiter (for example a string), use the 'enqueue' action as
soon as the start delimitter has been detected, and the 'commit' action as
soon as the end delimitter has been detected. This will eliminate the
potential for any counting errors.
Notes:
* You can be a bit sloppy with the line counting machine. For example the
only C entities that can contain newlines are strings and comments, so
INTERNAL_NL would only be necessary inside them. Other than those,
anything other than spaces is considered code, so do not waste your time
defining specific patterns for other entities.
Parsers with Embedded Languages:
Notation: [lang] is the parent language, [elang] is the embedded language.
To write a parser with embedded languages (such as HTML with embedded CSS
and Javascript), you should first #include the parser(s) above your Ragel
code. The header file is "[elang]_parser.h".
Next, after the inclusion of 'common.rl', add '#EMBED([elang])' on
separate lines for each embedded language. The Rakefile looks for these
special comments to embed the language for you automatically.
In your main action, you need to add another entity CHECK_BLANK_ENTRY. It
should call the 'check_blank_entry([lang]_LANG)' macro. Blank entries are
an entry into an embedded language, but the rest of the line is blank
before a newline. For example, a CSS entry in HTML is something like:
<style type="text/css">
If there is no CSS code after the entry (a blank entry), the line should
be counted as HTML code, and the 'check_blank_entry' macro handles this.
But you may be asking, "how do I get to the CHECK_BLANK_ENTRY entity?".
This will be discussed in just a bit.
Also use the emb_newline and emb_internal_newline macros instead of the
std_newline and std_internal_newline macros.
For each embedded language you will have to define an entry and outry. An
entry is the pattern that transitions from the parent language into the
child language. An outry is the pattern from child to parent. You will
need to put your entries in your [lang]_line machine. You will also need
to re-create each embedded language's line machine (define as
[lang]_[elang]_line; e.g. html_css_line) and put outry patterns in those.
Entries typically would be defined as [lang]_[elang]_entry, and outries
as [lang]_[elang]_outry.
Note: An outry should have a 'check_blank_outry' action so the line is not
mistakenly counted as a line of embedded language code if it is actually a
line of parent code.
Entry pattern actions should be:
[lang]_[elang]_entry @{ entity = CHECK_BLANK_ENTRY; } @[lang]_callback
@{ saw([elang]_LANG)} => { fcall [lang]_[elang]_line; };
What this does is checks for a blank entry, and if it is, counts the line
as a line of parent language code. If it is not, the macro will not do
anything. The machine then transitions into the child language.
Outry pattern actions should be:
@{ p = ts; fret; };
What this does is sets the current Ragel parser position to the beginning
of the outry so the line is counted as a line of parent language code if
no child code is on the same line. The machine then transitions into the
parent language.
Entity Identifying Machine:
This machine does not have to be written as a line-by-line parser. It only
has to identify the positions of language entities, such as whitespace,
comments, strings, etc. in sequence. As a result they can be written much
faster and more easily with less thought than a line counter. Using a
scanner is most efficient.
The callback function will be called for each entity parsed.
Scanner Structure:
[lang]_entity := |*
entity1 ${ entity = ENTITY1; } => [lang]_ecallback;
entity2 ${ entity = ENTITY2; } => [lang]_ecallback;
...
entityn ${ entity = ENTITYN; } => [lang]_ecallback;
*|;
Main Action Structure:
action [lang]_ecallback {
callback([lang]_LANG, [lang]_entities[entity], cint(ts), cint(te));
}
Note: the 'ls', 'code', 'comment', 'queue' and 'commit' actions are
completely unnecessary.
Parsers for Embedded Languages:
TODO: