-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
'taakku' — all analyses lost in grammar checker #5
Comments
Our own pipe's kal-tokenise moves such non-baseform prefixes, yielding |
I hadn't realised that the CG reading syntax requires it to start with |
the analysis in xerox format is: $ echo taakku | hfst-lookup src/fst/analyser-gt-desc.hfstol
taakku TA+una+Gram/Dem+Pron+Abs+Pl 0,000000
taakku TA+una+Gram/Dem+Pron+Rel+Pl 0,000000 I can guess that hfst-tokenise needs to do some guessworks to find out which parts of this analysis are lemma or tags based on common practices that don't include this kind of combination. I'm not sure what is the correct lemma/tags here either? |
The assumption for
Placement of string in the analysis is not considered in the Accented chars in Unicode using combining diacritics are always automatically converted to a sequence of symbols (in the FST sense), both to support the criteria above and to make parsing of input text simple and straightforward. |
We have kal-generate to move them back and turn CG into FST for generation. Greenlandic only has 2 prefixes, A decade ago it used to be analyzed as CG has always had somewhat strict stream format. CG-2 required first tag to be either |
I see that both kal-tokenise and kal-generate are perl scripts. That is not very portable for standalone grammar checkers. I understand that there is more to these scripts than just moving prefix tags, but would it be ok to replace that part of the scripts with some simple (Rust/C/whatever) code to move prefix tags back and forth as needed, and leave the rest for now? In the end I would like to have all the functionality of the perl scripts encoded in one of FST/CG/compiled binary, but I suggest we start with the prefix tags and see how that works. |
Sure. There are other things we need in the final pipe, though, such as the https://github.com/Oqaasileriffik/katersat semantic tags module. I would prefer to get |
What is the semantic tag module, and how is it used? |
I think I implemented most of it at on the way some time ago but it needs testing (and development) since it's the least portable and standardised code. |
Almost all of our semantic tags comes from our online dictionary interface, Katersat, and not the FST. Katersat is easier for everyone to work with. Student helpers and others can easily be taught how to tag semantics, provide translations, and explanations in Katersat, without needing to know how to change the FST. It's also vastly faster during development. We then extract those semantic tags from Katersat and apply them to the output of the FST, so the pipe can make use of them for disambiguation. |
This is what I get on current libdivvun: $ echo "taakku marluk inuunerminni taama pilluartigisimanngisaannarput" | modes/kalgram-full.mode
dependency.cg3: Warning: Barriers only make sense for scanning or self tests on line 11238 at 1A (/"piareer"\ Gram/IV\ SAR\ Der/vv\ Gram/TV\ Gram/Refl\ V/l) + VFIN BARRIER KOMMA OR VFIN.
functions.cg3: Warning: Barriers only make sense for scanning or self tests on line 8501 at 1 (/"piareer"\ Gram/IV\ SAR\ Der/vv\ Gram/TV\ Gram/Refl\ V/l) + VFIN BARRIER KOMMA OR VFIN.
"<taakku>"
"una" Prefix/TA Gram/Dem Pron Abs Pl <W:0.0> @Pron>N #1->2
:
"<marluk>"
"marluk" Num Abs Pl <W:0.0> @SUBJ> #2->5
:
"<inuunerminni>"
"inuk" U Der/nv Gram/IV NIQ Der/vn N Lok Sg 4PlPoss <W:0.0> @ADVL> #3->5
:
"<taama>"
"taama" Adv <W:0.0> @>V #4->5
:
"<pilluartigisimanngisaannarput>"
"pilluar" Gram/IV TIGE Der/vv Gram/IV SIMA Der/vv NNGISAANNAR Der/vv Gram/IV V Ind 3Pl <W:0.0> @PRED #5->0
:\n I hacked the mode file to point the |
Yep, that's as it should be. But the rest of libdivvun needs to handle
Which seems broken, as the unprefixed word |
ah I think I get it now, at least the grammar checker needs to restore the prefix tags back to original places before generating, probably the speller component does something with the tags too, I cannot remember how it works exactly. |
I made cgspell part throw all tags before lemma to the other side with Prefix/ now. |
In what order? Same as before throw, or reversed, or undefined? |
Should come in the same order for now. I guess up until the grammar correction generator it's just handled by VISL CG 3 rules so it's ordering agnostic. |
CG is order agnostic, but the generator is not, so we need the order to be fixed. As long as we agree what that fixed order is there is no problem 🙂 Same order seems fine and logical. |
Is there an example of prefixed word generation issue for the grammar checker part? The spell-checker should work now. |
Seems to be working: "" This was before: |
Cf this:
In this first step everything is correct. But in the next step something strange is happening:
The word form has been moved to after the analyser output. I have no idea why and how. In any case, this leads to the analyses getting lost later, leaving the bare word form:
Any idea, @TinoDidriksen @unhammer @flammie ?
The text was updated successfully, but these errors were encountered: