-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathmsta.sgml.in
1429 lines (1269 loc) · 55.6 KB
/
msta.sgml.in
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
<!doctype linuxdoc system>
<article>
<!-- Title information -->
<title>MSTA (syntax description translator)
<author>Vladimir Makarov, <tt>[email protected]</tt>
<date>May 5, 1999
<abstract>
This document describes MSTA (translator of syntax description of a language
into code for parsing programs on the language).
</abstract>
<toc>
<sect>Introduction
<p>
MSTA is syntax description translator analogous to YACC. Although
MSTA can fully emulate YACC, MSTA is aimed to solve some drawbacks of
YACC. Therefore MSTA has the following additional features:
<itemize>
<item>Fast LR(k) and LALR(k) grammars (with possibility resolution of
conflicts). Look ahead of only necessary depth (not necessary
given k). Originally LALR(k) parsers are generated by modified
fast DeRemer's algorithm. Parsers generated by MSTA are up to 50%
faster than ones generated by BISON and BYACC but usually have
bigger size.
<item>Extended Backus-Naur Form (EBNF), and constructions for more
convenient description of the scanners. More convenient
naming attributes.
<item>Optimizations (extracting LALR- and regular parts of grammars and
implementing parsing them by adequate methods) which permit to
use MSTA for generation of effective lexical analyzers. As
consequence MSTA permits to describe easily (by CFG) scanners
which can not be described by regular expressions (i.e. nested
comments).
<item>More safe error recovery and reporting (the 1st additional
error recovery method besides error recovery method of YACC).
<item>A minimal error recovery and reporting (the 2nd additional
error recovery method besides error recovery method of YACC).
<item>Fast generation of fast parsers.
</itemize>
<sect>MSTA description language
<p>
MSTA description languages is superset of YACC language. The major
additional features are Extended Backus Naur Form (EBNF) for more
convenient descriptions of languages, additional constructions in
rules for more convenient description of scanners, and named
attributes.
<sect1>Layout of MSTA description
<p>
MSTA description structure has the following layout which
is similar to one of YACC file.
<tscreen><verb>
DECLARATIONS
%%
RULES
%%
ADDITIONAL C/C++ CODE
</verb></tscreen>
The `%%' serves to separate the sections of description. All sections
are optional. The first `%%' starts section of keywords and is
obligatory even if the section is empty, the second `%%' may be absent
if section of additional C/C++ code is absent too.
Full YACC syntax of MSTA description file is placed in Appendix 1.
<sect1>Declarations
<p>
The section of declarations may contain the following construction:
<tscreen><verb>
%start identifier
</verb></tscreen>
which determines axiom of the grammar. If such construction is
absent, the axiom is believed to be nonterminal in the left hand side
of the first rule. If there are several such construction, all ones
except for the first are ignored.
By default, the values of attributes of the terminals (tokens) and
nonterminals shall be integers. If you are going to use the values of
different types, you shall use
<tscreen><verb>
<tag>
</verb></tscreen>
in constructs declaring symbols (%token, %type, %left, ...) and shall
insert corresponding union member names in the following construction:
<tscreen><verb>
%union { body of union in C/C++ }
</verb></tscreen>
Alternatively, the union can be declared in interface file, and a
typedef used to define the symbol YYSTYPE (see generated code) to
represent this union. The effect of %union is to provide the
declaration of YYSTYPE directly from the input.
There is group of the following declarators which take token
(terminal) or nonterminal names as arguments.
<tscreen><verb>
%token [<tag>] name [number] [name [number]]...
%left [<tag>] name [number] [name [number]]...
%right [<tag>] name [number] [name [number]]...
%nonassoc [<tag>] name [number] [name [number]]...
%type <tag> name...
</verb></tscreen>
The names can optionally be preceded by the name of a C/C++ union
member (called a tag see above) appearing within ``<'' and ``>''. The
use of tag specifies that the tokens or nonterminals named in this
construction are to be of the same C/C++ type as the union member
referenced by the tag.
If symbol used in grammar is undefined by a %token, %left, %right, or
%nonassoc declaration, the symbol will be considered as a nonterminal.
The first occurrence of a given token can be followed by a positive
integer in constructions `%token', `%left', `%right', and `%nonassoc'
defining tokens. In this case the value assigned to it shall be code
of the corresponding token returned by scanner.
Constructions `%left', `%right', and `%nonassoc' assign precedence and
to the corresponding tokens. All tokens in the same construction have
the same precedence level and associativity; the constructions is
suggested to be placed in order of increasing precedence.
Construction `%left' denotes that the operators (tokens) in that
construction are left associative, and construction `%right' similarly
denotes right associative operators.
Construction `%nonassoc' means that tokens cannot be used
associatively. If the parser encounters associative use of this token
it will report an error.
The construction `%type' means that the attributes of the
corresponding nonterminals are of type given in the tag field.
Once the type, precedence, or token number of a symbol is specified,
it shall not be changed. If the first declaration of a token does not
assign a token number, MSTA will assign a token number. Once this
assignment is made, the token number shall not be changed by explicit
assignment.
Usually real grammars can not be declared without shift/reduce
conflicts. To control suggested number of shift/reduce conflicts, the
following construction can be used.
<tscreen><verb>
%expect number
</verb></tscreen>
If such construction is present, MSTA will report error if the number
of shift/reduce conflicts is not the same as one in the construction.
Remember that it is not standard YACC construction.
The following construction in declarations means that the scanner
should be generated.
<tscreen><verb>
%scanner
</verb></tscreen>
There are the following major differences in parser and scanner
generated by MSTA
<itemize>
<item>In order to use a MSTA generated parser with a MSTA generated
scanner, all objects in a MSTA scanner (variables, types, macro,
and so on) are named by adding letter `s' or `S' after prefixes
`yy' or `YY'.
<item>Additional function `yylex_start' is generated. The function
should be used for initiation of scanner (see Generated code).
<item>Function `yylex' is generated instead of function `yyparse'.
This function can be called many times for getting next token.
Code of the next token is suggested to returned by statements
`return' in the actions. Input stream (look ahead characters)
is saved from a call of `yylex' to the next its call.
<item>Instead of function `yylex' a function `yyslex' is
used to read the next character (token in terminology of MSTA
specification file) from the input stream. -1 is used as the
end of file instead of 0 because scanner must read and process
zero characters.
<item>Macro `YYSABORT' is -1 in order to differ token code from
flag of finishing work of the scanner. Remember that analogous
macro `YYABORT' for MSTA parser is 1.
<item>To extract all regular parts in the scanner grammar,
splitting LR-sets is fulfilled (see MSTA implementation).
</itemize>
You can look at a scanner specification in Appendix 2.
There may be also the following constructions in the declaration
section
<tscreen><verb>
%{
C/C++ DECLARATIONS
%}
%local {
C/C++ DECLARATIONS
}
%import {
C/C++ DECLARATION
}
and
%export {
C/C++ DECLARATION
}
</verb></tscreen>
which contain any C/C++ declarations (types, variables, macros, and so
on) used in sections. Remember the only first construction is
standard POSIX YACC construction.
The local C/C++ declarations are inserted at the begin of generated
implementation file (see section `generated code') but after
include-directive of interface file (if present -- see MSTA Usage).
You also can use more traditional construction of YACC %{ ... %}
instead.
C/C++ declarations which start with `%import' are inserted at the
begin of generated interface file. If the interface file is not
generated, the code is inserted at the begin of the part of
implementation file which would correspond the interface file.
C/C++ declarations which start with `%export' are inserted at the end
of generated interface file. For example, such exported C/C++ code
may contain definitions of external variables and functions which
refer to definitions generated by MSTA. If the interface file is
not generated, the code is inserted at the end of the part of
implementation file which would correspond the interface file.
All C/C++ declarations are placed in the same order as in the section
of declarations.
<sect1>Rules
<p>
The section of declarations is followed by section of rules.
The rules section defines the context-free grammar to be accepted by
the function yacc generates, and associates with those rules C
language actions and additional precedence information. The grammar
is described below, and a formal definition follows.
The rules section contains one or more grammar rules. A grammar rule
has the following form:
<tscreen><verb>
nonterminal : pattern ;
</verb></tscreen>
The nonterminal in the left side hand of the rule describes a language
construction and pattern into which the nonterminal is derivated. The
semicolon at the end of the rule can be absent.
MSTA can use EBNF (Extended Backus-Naur Form) to describe the
patterns. Because the pattern can be quite complex, MSTA internally
transforms rules in the description into simple rules and assigns a
unique number to each simple rule. Simple rule can contains only
sequence of nonterminals and tokens. Simple rules and the numbers
assigned to the rules appear in the description file (see MSTA usage).
To achieve to the simple rules, MSTA makes the following
transformations (in the same order).
<enum>
<item>Alternatives
<tscreen><verb>
nonterminal : pattern1 | pattern2
</verb></tscreen>
are transformed into
<tscreen><verb>
nonterminal : pattern1
nonterminal : pattern2
</verb></tscreen>
<item>Lists
<tscreen><verb>
nonterminal : ... pattern / s_pattern ...
</verb></tscreen>
are transformed into
<tscreen><verb>
nonterminal : ... N ...
N : N s_patter pattern
</verb></tscreen>
N denotes here a new nonterminal created during the
transformation. This construction is very convenient for
description of lists with separators, e.g. identifier separated
by commas. Remember that the lists are not feature of standard
POSIX YACC.
<p>
<item>Naming
<tscreen><verb>
nonterminal : ... N @ identifier ...
</verb></tscreen>
is transformed into
<tscreen><verb>
nonterminal : ... N ...
</verb></tscreen>
Here N denotes a nonterminal, a token, or the following
constructions. Instead of number in actions, the identifier
can be used for naming attributes of the nonterminal, the
token, or nonterminal which is created during transformation of
the following constructions. Remember that the naming is not
feature of standard POSIX YACC.
<p>
<item>Optional construction
<tscreen><verb>
nonterminal : ... [ pattern ] ...
</verb></tscreen>
is transformed into
<tscreen><verb>
nonterminal : ... N ...
N : pattern
N :
</verb></tscreen>
N denotes here a new nonterminal created during the
transformation. This construction is very convenient for
description of optional constructions. Remember that the
optional construction is not feature of standard POSIX YACC.
<p>
<item>Optional repetition
<tscreen><verb>
nonterminal : ... pattern * ...
</verb></tscreen>
is transformed into
<tscreen><verb>
nonterminal : ... N ...
N : N pattern
N :
</verb></tscreen>
N denotes here a new nonterminal created during the
transformation. This construction is very convenient for
description of zero or more the patterns. Remember that the
optional repetition is not feature of standard POSIX YACC.
<p>
<item>Repetition
<tscreen><verb>
nonterminal : ... pattern + ...
</verb></tscreen>
is transformed into
<tscreen><verb>
nonterminal : ... N ...
N : N pattern
N : pattern
</verb></tscreen>
N denotes here a new nonterminal created during the
transformation. This construction is very convenient for
description of one or more the patterns. Remember that the
repetition is not feature of standard POSIX YACC.
<p>
<item>Grouping
<tscreen><verb>
nonterminal : ... ( pattern ) ...
</verb></tscreen>
is transformed into
<tscreen><verb>
nonterminal : ... N ...
N : pattern
</verb></tscreen>
N denotes here a new nonterminal created during the
transformation. This construction is necessary to change
priority of the transformations. Remember that the grouping is
not feature of standard POSIX YACC.
<p>
<item>String
<tscreen><verb>
nonterminal : ... string ...
</verb></tscreen>
is transformed into
<tscreen><verb>
nonterminal : ... '1st char' '2nd char' ... 'last char' ...
</verb></tscreen>
Here the string is simply sequence of string characters as MSTA
literals. Remember that the strings are not standard feature
of POSIX YACC.
<p>
<item>Range
<tscreen><verb>
nonterminal : ... token1 - tokenN ...
</verb></tscreen>
is transformed into
<tscreen><verb>
nonterminal : N
N : token1
N : token2
...
N : tokenN
</verb></tscreen>
N denotes here a new nonterminal created during the
transformation. The range is simply any token with code
between code of token1 and code of token2 (inclusively). The
code of token1 must be less or equal to the code of token2.
Remember that the ranges are not feature of standard POSIX
YACC.
<p>
<item>Left open range
<tscreen><verb>
nonterminal : ... token1 <- tokenN ...
</verb></tscreen>
is transformed into
<tscreen><verb>
nonterminal : N
N : token2
N : token3
...
N : tokenN
</verb></tscreen>
N denotes here a new nonterminal created during the
transformation. The left open range is simply any token with
code between code of token1 + 1 and code of token2
(inclusively). The code of token1 must be less to the code of
token2. Remember that the ranges are not feature of standard
POSIX YACC.
<p>
<item>Right open range
<tscreen><verb>
nonterminal : ... token1 -> tokenN ...
</verb></tscreen>
is transformed into
<tscreen><verb>
nonterminal : N
N : token1
N : token2
...
N : tokenN-1
</verb></tscreen>
N denotes here a new nonterminal created during the
transformation. The right open range is simply any token with
code between code of token1 and code of token2 - 1
(inclusively). The code of token1 must be less to the code of
token2. Remember that the ranges are not feature of standard
POSIX YACC.
<p>
<item>Left right open range
<tscreen><verb>
nonterminal : ... token1 <-> tokenN ...
</verb></tscreen>
is transformed into
<tscreen><verb>
nonterminal : N
N : token2
N : token3
...
N : tokenN-1
</verb></tscreen>
N denotes here a new nonterminal created during the
transformation. The left right open range is simply any token
with code between code of token1 + 1 and code of token2 - 1
(inclusively). The code of token1 must be less to the code of
token2 - 1. Remember that the ranges are not feature of
standard POSIX YACC.
<p>
<item>Action inside pattern
<tscreen><verb>
nonterminal : ... action something non empty
</verb></tscreen>
is transformed into
<tscreen><verb>
nonterminal : ... N something non empty
N : action
</verb></tscreen>
N denotes here a new nonterminal created during the
transformation. The action is a C/C++ block.
</enum>
After the all possible transformations mentioned above, the rules will
contain sequence of only tokens (literals or token identifiers) and
nonterminals finishing optional %prec or/and %la construction or/and
an action.
The action is an arbitrary C/C++ block, i.e. declarations and
statements enclosed in curly braces { and }. Certain pseudo-variables
can be used in the action for attribute references. These are changed
by data structures known internally to MSTA. The pseudo-variables have
the following forms:
<descrip>
<tag>$$</tag>
This pseudo-variable denotes the nonterminal in the left
hand side of the simple rule.
<tag>$number</tag>
This pseudo-variable refers to the attribute of sequence
element (nonterminal, token, or action) specified by its
number in the right side of the rule before changing actions
inside pattern (see transformation above), reading from left
to right. The number can be zero or negative. If it is, it
refers to the attribute of the symbol (token or nonterminal)
on the parser's stack preceding the leftmost symbol of the
rule. (That is, $0 refers to the attribute of the symbol
immediately preceding the leftmost symbol in the rule, to be
found on the parser's stack, and $-1 refers to the symbol to
its left.) If number refers to an element past the current
point in the rule (i.e. past the action), or beyond the
bottom of the stack, the result is undefined.
<tag>$identifier</tag>
These pseudo-variable is analogous to the previous one but
the attribute name is used instead of its number. Of course
the attribute naming must exist.
<tag>$<...>number</tag>
This pseudo-variable is used when there are attributes of
different types in the grammar and the number corresponds to
the nonterminal whose type is not known because the
nonterminal has been generated during the transformation of
rules into the simple rules. The type name of the attribute
is placed into angle braces.
<tag>$<...>identifier</tag>
These pseudo-variable is analogous to the previous one but
the attribute name is used instead of its number. Of course
the attribute naming must exist.
<tag>$<...>$</tag>
This pseudo-variable is used when there are attributes of
different types in the grammar and the type of nonterminal
is not known because the nonterminal has been generated
during the transformation of rules into the simple rules.
</descrip>
Messages about some shift/reduce conflicts are not generated if the
rules in the conflict has priority and associativity. The priority
and associativity of rule are simply the precedence level and
associativity of the last token in the rule with declared precedence
level and associativity.
The optional construction `%prec ...' can be used to change the
precedence level associated with a particular simple rule. Examples
of this are in cases where a unary and binary operator have the same
symbolic representation, but need to be given different precedences.
The reserved keyword `%prec' can be followed by a token identifier or
a literal. It shall cause the precedence of the grammar rule to
become that of the following token identifier or literal.
The optional construction `%la number' can be used to change the
maximal look ahead associated with a particular simple rule. Example
of this is when there is a classical conflict if-then-else which is to
be resolved correctly with look ahead equal to 1 and there is a rule
with conflict which must be resolved with look ahead equal to 3. In
this case you can call MSTA with maximal look ahead equal to 1 (this
is default) and place %la 3 in the rule which takes part in the
conflict which must be resolved with look ahead equal to 3.
If a program section follows, the grammar rules shall be terminated by
%%.
<sect>Generated code
<p>
A specification as described in the previous section is translated by
MSTA into optional interface and
implementation files having the same names as one of specification
file and correspondingly suffixes `.h' and `.c' (C code) or `.cpp'
(C++ code). By default the interface file is not generated.
<sect1>C code
<p>
The interface and implementation files consist of the following
definitions of generated macros, types, and functions (unless special
information for MSTA scanner is mentioned, MSTA scanner object have
the same sense and names with additional s or S after the prefix `yy'
or `YY'):
<descrip>
<tag>YYSTYPE</tag>
By default this is macro. The macro value is type used for
representing the parser attributes. By default this macro is
defined as `int'. You can redefine the macro if you place
definition of the macro before standard definition of the
macro.
<p>
If construction `%union' is present in the specification
file, YYSTYPE is type definition of union with the code
written inside construction `%union'.
<p>
The definition of YYSTYPE is placed in the interface file if
option `-d' is on MSTA command line. Otherwise the
definition will be in the implementation file. YYSTYPE is a
part of YACC POSIX standard.
<tag>yychar</tag>
This variable contains code of the current token. The
current token is not latest read token because MSTA can look
ahead far. The codes are returned by scanner function
`yylex'. If option `-d' is present on the command line (see
MSTA usage), external definition of the variable is also
placed in the interface file. The variable is a part of YACC
POSIX standard.
<tag>yylval</tag>
This variable is used to exchange information of the parser
with a scanner. The scanner must return attribute of the
latest read token in this variable. After that the variable
contains attribute of the current token, i.e. whose code is
in the variable `yychar'. The variable `yylval' is declared
of type YYSTYPE. If option `-d' is present on the command
line (see MSTA usage), external definition of the variable is
also placed in the interface file. The variable is a part of
YACC POSIX standard.
<tag>YYDEBUG</tag>
The parser generated by MSTA has code for diagnostics. The
compilation of the runtime debugging code is under the
control of YYDEBUG, a preprocessor symbol. If YYDEBUG has a
nonzero value, the debugging code will be included. If its
value is zero, the code will not be included. The macro is a
part of YACC POSIX standard.
<tag>yydebug</tag>
In parser where the debugging code has been included (see
macro YYDEBUG), the variable `yydebug' can be used to turn
debugging on (with a nonzero value) and off (zero value) at
run time. The initial value of yydebug is zero. If option
`-d' is present on the command line (see MSTA usage),
external definition of the variable is also placed in the
interface file. The variable is a part of YACC POSIX
standard.
<tag>int yyparse ()</tag>
This function is main function of MSTA parser. The function
makes parsing of the token sequence whose codes are returned
by user-defined function `yylex' and whose attributes if any
are placed in variable `yylval'. The function returns 0 if
the parser successfully finished work. Nonzero returned status
means that the parser found unrecoverable errors (or macro
YYABORT was executed explicitly). This function is a part of
YACC POSIX standard.
<p>
This function has name `yylex' for MSTA scanner. The
function makes scanning of the character (token in
terminology of MSTA specification file) sequence whose codes
are returned by function `yyslex' and whose attributes if
any are placed in variable `yyslval'. The function returns 0
if the parser successfully finished work and reach end of
input file stream. Negative returned status means that the
parser found unrecoverable errors (or macro YYSABORT was
executed explicitly). This function can be called many times
for getting next token. Code of the next token is suggested
to returned by statements `return' in the actions. Input
stream (look ahead characters) is saved from a call of
`yylex' to the next its call.
<tag>int yylex ()</tag>
This function is an external function to the MSTA parser.
User must provide it. Each call of the function should
return code of the next input token. If end of input is
reached, the function should return zero (-1 for `yyslex).
Attribute of token whose code returned by the function should
be returned by the function through variable `yylval'. In
the case of MSTA scanner, function `yyparse' has name
`yylex'.
<tag>void yylex_start (int *error_code)</tag>
The function `yylex_start' is generated only for MSTA
scanner. The function should be used for initiation of the
scanner. Nonzero value returned through the parameter means
that ther was error in memory allocation for the scanner
(this is a fatal error). The function is not a part of YACC
POSIX standard.
<tag>yyprev_char</tag>
Its value is the latest shifted token (character) code.
Usually the value is used for forming internal representation
of tokens (e.g. identifier internal representation or number
value). The variable is not a part of YACC POSIX standard.
<tag>YYACCEPT</tag>
The macro YYACCEPT will cause the parser to return with the
value zero. This means normal parser work finish. The macro
is a part of YACC POSIX standard.
<tag>YYABORT</tag>
The macro YYABORT will cause the parser to return with a
nonzero value (1 for MSTA parser and -1 for macro YYSABORT
MSTA scanner). This means abnormal parser work finish. The
macro is a part of YACC POSIX standard.
<tag>yyerror</tag>
When the parser detects a syntax error in its normal state,
it normally calls external function yyerror with string
argument whose value is defined by macro YYERROR_MESSAGE.
User must provide function `yyerror' for building parser
program. After that the parser jumps to recovery mode. The
parser is considered to be recovering from a previous error
until the parser has shifted over at least
YYERR_RECOVERY_MATCHES normal input tokens since the last
error was detected or a semantic action has executed the
macro `yyerrok'. The function is a part of YACC POSIX
standard.
<p>
Recovery mode consists of on or more steps. Each recovery
step starts with searching for the uppest stack state on
which the shift on special symbol `error' is possible. This
state becomes the top stack state, and shift on `error' is
made. After that the parser discards all tokens which can
not be after the symbol `error' in this state (so called stop
symbols). After that any recognized syntatic error results
in the new error recovery step. This is technique of standard
YACC error recovery. Such technique may result in infinite
looping of the parser or discarding all input tokens if the
stop symbols are not met.
<p>
By default MSTA generates the standard YACC error recovery.
There are two additional methods which msta can generate for
error recovery.
<p>
The first one is a local error recovery which does not permit
infinite parser looping and use context after several error
as stop symbols. According this method look ahead set also
includes look ahead tokens after token `error' in states
which have the `error' token is acceptable and which are
lower in the parser stack than the first state with
acceptable token `error'. In this case the feedback from the
parser to the scanner could not work correctly because
although rule actions are executed in such case the parser
reads the tokens once.
<p>
The second one is a minimal cost error recovery where the
cost is overall number of tokens ignored. The feedback from
the parser to the scanner does not work correctly. So you
shouldn't use this method when there is the feedback.
Calling `yyerrok' has no sense for such method because the
parser in such recovery mode never executes the rule actions.
This method is the best quality error recovery although it my
be expensive method because in the worst case it might save
all input tokens.
<tag>YYERROR_MESSAGE</tag>
The macro value is used as a parameter of function yyerror
when a syntax error occurs. The default value of macro is
"syntax error" ("lexical error" for a scanner). You can
redefine its value. But in any case the value should be a
string. The macro is not a part of YACC POSIX standard.
<tag>YYERR_RECOVERY_MATCHES</tag>
The parser is considered to be recovering from a previous
error until the parser has shifted over at least
YYSERR_RECOVERY_MATCHES normal input tokens since the last
error was detected or a semantic action has executed the
macro `yyerrok'. The default value of macro is 3. You can
redefine its value. But in any case the value will be
positive. This macro is not a part of YACC POSIX standard.
<tag>YYERR_MAX_LOOK_AHEAD_CHARS</tag>
This macro is generated only when the local recovery mode is
used. The default value is 7. This value can not be less 1.
See description below.
<tag>YYERR_LOOK_AHEAD_INCREMENT</tag>
This macro is generated only when the local error recovery
mode is used. The default value is 3. This value can not be
less 0. See description below.
<tag>YYERR_POPPED_ERROR_STATES</tag>
This macro is generated only when the local error recovery
mode is used. The default value is 2. This value can not be
less 0. See description below.
<tag>YYERR_DISCARDED_CHARS</tag>
This macro is generated only when the local error recovery
mode is used. The default value is 3. This value can not be
less 0. See description below.
<tag>yydeeper_error_try</tag>
This and the previous macros (YYERR_MAX_LOOK_AHEAD_CHARS -
YYERR_DISCARDED_CHARS) are used only when the local error
recovery is generated. Before starting description of the
local error recovery, let me remind how YACC error recovery
works. When the parser recognizes a syntactic error, it
switches into error recovery mode. Error recovery itself
consists of one or more steps. Each step consists of finding
the top state on the stack with possible shift on
pseudo-token `error', throwing all states upper the state
with `error', and making shift on the `error' token. After
that all token are discarded until token (so called stop
symbol) which can be after the pseudo-token `error' is read.
After that any recognized error results in the local error
recovery step. And finally the error recovery is switched
off only when YYERR_RECOVERY_MATCHES (by default 3) tokens
are shifted without occurring syntactic error.
<p>
The differences of the local error recovery from classic YACC error
recovery is in the following:
<itemize>
<item>The parser saves all discarded tokens in error
recovery mode and returns them back into the input stream
on the local error recovery step.
<item>Only YYERR_LOOK_AHEAD_INCREMENT tokens can be
discarded on the first step, 2 * YYERR_LOOK_AHEAD_INCREMENT
on the second step and so on (but no more
YYERR_MAX_LOOK_AHEAD_CHARS tokens).
<item>If the parser requires discarding more tokens which is
possible on the step, the local error recovery step starts.
Moreover if action `yydeeper_error_try' has been fulfilled
on the previous step, the new step starts searching for the
error state on the stack with the state which is deeper
than the error state on the previous error recovery step.
Otherwise, as usually searching for the error state starts
with the top of the stack.
<item>On each YYERR_POPPED_ERROR_STATES error recovery step
(and correspondingly on each YYERR_POPPED_ERROR_STATES
processing the error state), the parser discards
YYERR_DISCARDED_CHARS tokens without saving them before
searching for the stop symbols.
</itemize>
By default MSTA generates YACC error recovery which does not
permit infinite parser looping and use context after several
error as stop symbols. The following fragment illustrates
usage of the local error recovery mode.
<tscreen><verb>
#define YYERR_END_RECOVERY() yyerr_end_recovery()
...
program :
| program function
...
function : ...
| error END FUNCTION
{yyerror ("error in function");}
...
statement : ...
| error
{
yyerror ("error in statement");
...
}
...
expression : ...
| error
{
yyerror ("error in expression");
...
}
...
yyerror (char *s)
{
/* save string s */
}
yyerr_end_recovery ()
{
/* print last saved error message. */
}
</verb></tscreen>
Note that action for error rule for function does not use
macro `yydeeper_error_try', this is warranty that the all
program will be processed.
<tag>YYRECOVERING()</tag>
The macro YYRECOVERING serves to determine in which state
the parser works now. The macro returns 1 if a syntax error
has been detected and the parser has not yet fully recovered
from it. Otherwise, zero is returned. The macro is a
part of YACC POSIX standard.
<tag>YYERROR</tag>
The parser detects a syntax error when it is in a state where
the action associated with the lookahead symbol(s) is error.
A semantic action can cause the parser to initiate error
handling by executing the macro YYERROR. When YYERROR is
executed, the semantic action passes control back to the
parser. YYERROR can be placed only in the semantic action
itself (not in a function called from the semantic action).
The single difference between error detected in the parser
input and error caused by macro YYERROR is that the function
`yyerror' is not called in the second case. The macro is a
part of YACC POSIX standard.
<tag>yynerrs</tag>
Actually this variable contains the number of switching the
parser state from normal to error recovery. This switching
is performed by fixing error in the input or by executing
macro YYERROR. The macro is not a part of YACC POSIX
standard. In the case of MSTA scanner, the variable
accumulates the number for all calls of `yylex'.
<tag>yyerrok</tag>
This macro can be used only in a semantic action itself. The
macro causes the parser to act as if it has fully recovered
from any previous errors. The macro is a part of YACC POSIX
standard. The macro has no sense for minimal error recovery
method because the parser in such recovery mode never
executes the rule actions.
<tag>YYERR_END_RECOVERY()</tag>
This macro is called when the parser switches from the
recovery state into normal state. By default the macro does
nothing. You can redefine this macro, e.g. to output the
last error buffered by your `yyerror' function in order to
implement better error diagnostics of the parser in the local
recovery mode. The macro is not a part of YACC POSIX
standard and the macro is not generated when yacc error
recovery is used.
<tag>YYERRCODE</tag>
The token error is reserved for error handling. The name
error can be used in grammar rules. It indicates places
where the parser can recover from a syntax error. The
default value of error shall be 256. Its value can be
changed using a %token declaration. In any case the code of
token error is value of macro YYERRCODE.
<tag>yyclearin</tag>
This macro cause the parser to discard the current lookahead
token. If the current lookahead token has not yet been read,
yyclearin has no effect. The macro is a part of YACC POSIX
standard.
<tag>YYALLOC, YYREALLOC, YYFREE</tag>
MSTA uses memory allocation for the state and attribute
stacks. Moreover, stacks can be expandable (with the aid of
YYREALLOC). The macro values are used for the stack memory
allocation/reallocation/freeing. Default value of the macros
are standard C functions malloc, realloc, free. You can
redefine this value. The macros are not a part of YACC POSIX
standard.
<tag>YYSTACK_SIZE</tag>
The macro value is initial size of state and attribute stacks
of the parser. If a stack become overfull, macro YYABORT is
executed when option -no-expand is used. Otherwise, the
stacks are expanded. It is better to use left recursion in
grammar rules in order to do not make overfull stacks.
Default value of the macro is 500. You can redefine this
value. The macro is not a part of YACC POSIX standard.
<tag>YYMAX_STACK_SIZE</tag>
The macro value is maximal size of state and attribute stacks
of the parser. The macro is used when the stacks are
expandable. If a stack size become bigger (may be after
several stacks expansions), macro YYABORT is executed.
Otherwise, the stacks are expanded. Default value of the
macro is 5000. You can redefine this value. The macro is
not a part of YACC POSIX standard.
<tag>YYMAX_STACK_EXPAND_SIZE</tag>
The macro value is step of state and attribute stacks
expansion. The macro is used only when the stacks are
expandable. Default value of the macro is 100. You can
redefine this value. The macro is not a part of YACC POSIX
standard.
<tag>YYMSTA</tag>
This macro defined as 1 is generated in order to differ the
parser generated by YACC, BISON, or MSTA. Naturally the
macro is not a part of YACC POSIX standard.
<tag>YYTOKEN_NAME(code)</tag>
This macro returns printable representation of token with
given code. The macro is not a part of YACC POSIX
standard.
<tag>YYLAST_TOKEN_CODE</tag>
This macro value is maximal code of tokens. the
macro is not a part of YACC POSIX standard.
</descrip>
<sect1>C++ code
<p>
The major advantage of C++ code is that it is quite easy to create
many parsers of one language (and consequently reenterable parser).
This is useful for implementation of module languages and languages
with macro directives of type of C include directive.
Generated C++ code is different from C code in the following features:
<itemize>
<item>Abstract class `yyparser' is generated for a parser and
`yyscanner' for a scanner. The definition of class will be
present also in interface file if the interface file is
generated (see MSTA usage).
<item>Variables `yylval' (`yyslval'), `yychar' (`yyschar'), and
`yydebug' (`yysdebug') are now public members of the class.
<item>Functions `yylex' (`yyslex'), `yyerror' (`yyserror') are
now abstract public virtual functions of the class.
<item>Functions `yyparse' for a parser and `yylex' for a
scanner are now public functions of the class.
<item>Function `yylex_start' for a scanner is not generated
because constructor of the class
<tscreen><verb>
yyscanner (int &)
</verb></tscreen>
replaces the function.
<item>The class contains also virtual destructor.
</itemize>
Usually the parser (scanner) itself is implemented as sub-class of
class `yyparser' (`yyscanner'). This subclass contains definition of
functions `yylex' (`yyslex') and `yyerror' (`yyserror').
<sect>Implementation
<p>
The following figure shows what major algorithms MSTA can uses to
generate the parsers:
<tscreen><verb>
1. Building -----> 4. LALR-optimization -----> 5. regular-optimization
LR(k)-sets ^ |
| |
| |
2. Building LALR(k)-sets-------->------------| V
by generalized | 6. equivalent tokens
DeRemer algorithm |
|
|
3. Building LALR(k)-sets------->-------------