Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

'taakku' — all analyses lost in grammar checker #5

Open
snomos opened this issue Dec 18, 2024 · 20 comments
Open

'taakku' — all analyses lost in grammar checker #5

snomos opened this issue Dec 18, 2024 · 20 comments
Labels
bug Something isn't working

Comments

@snomos
Copy link
Member

snomos commented Dec 18, 2024

Cf this:

echo "taakku marluk inuunerminni taama pilluartigisimanngisaannarput" | modes/trace-kalgram0-morph.mode
"<taakku>"
    TA "una" Gram/Dem Pron Abs Pl <W:0.0>
    TA "una" Gram/Dem Pron Rel Pl <W:0.0>
:

In this first step everything is correct. But in the next step something strange is happening:

echo "taakku marluk inuunerminni taama pilluartigisimanngisaannarput" | modes/trace-kalgram1-blanktag.mode
    TA "una" Gram/Dem Pron Abs Pl <W:0.0>
    TA "una" Gram/Dem Pron Rel Pl <W:0.0>
"<taakku>"
:

The word form has been moved to after the analyser output. I have no idea why and how. In any case, this leads to the analyses getting lost later, leaving the bare word form:

echo "taakku marluk inuunerminni taama pilluartigisimanngisaannarput" | modes/trace-kalgram.mode 
"<taakku>"
: 
"<marluk>"
	"marluk" Num Abs Pl <W:0.0>
;	"marluk" Num Rel Pl <W:0.0> REMOVE:2385:tidlig0020A
;	"marluk" Orth/Alt N Abs Sg <W:0.0> REMOVE:2189:0001P
:

Any idea, @TinoDidriksen @unhammer @flammie ?

@snomos snomos added the bug Something isn't working label Dec 18, 2024
@TinoDidriksen
Copy link
Member

TA "una" Gram/Dem Pron Abs Pl <W:0.0> is not a valid CG-reading, as it doesn't start with ". Thus it is treated as text and moved out of the cohort it was in.

Our own pipe's kal-tokenise moves such non-baseform prefixes, yielding "una" Prefix/TA Gram/Dem Pron Abs Pl <W:0.0>. And it performs other needed corrections.

@snomos
Copy link
Member Author

snomos commented Dec 18, 2024

I hadn't realised that the CG reading syntax requires it to start with " - this is quite problematic when using as input FST analysis of prefix-heavy languages. What do you do then? In a grammar checker context tag order is important, as the tag order needs to be retained for word form generation at the end of the processing.

@flammie
Copy link
Contributor

flammie commented Dec 18, 2024

the analysis in xerox format is:

$ echo taakku | hfst-lookup src/fst/analyser-gt-desc.hfstol 
taakku	TA+una+Gram/Dem+Pron+Abs+Pl	0,000000
taakku	TA+una+Gram/Dem+Pron+Rel+Pl	0,000000

I can guess that hfst-tokenise needs to do some guessworks to find out which parts of this analysis are lemma or tags based on common practices that don't include this kind of combination. I'm not sure what is the correct lemma/tags here either?

@snomos
Copy link
Member Author

snomos commented Dec 18, 2024

The assumption for hfst-tokenise is very simple, and automatically handled in the FST pipeline:

  • all multichar symbols are tags
  • sequences of non-multichars are strings/word forms
  • there should be one and only one such string pr line in the analysis cohort

Placement of string in the analysis is not considered in the hfst-tokenise output, exactly because of prefixing languages.

Accented chars in Unicode using combining diacritics are always automatically converted to a sequence of symbols (in the FST sense), both to support the criteria above and to make parsing of input text simple and straightforward.

@TinoDidriksen
Copy link
Member

We have kal-generate to move them back and turn CG into FST for generation. Greenlandic only has 2 prefixes, AA and TA, and they are definitely not the baseform.

A decade ago it used to be analyzed as "TA" una Gram/Dem Pron Abs Pl because we simply defined the first tag as baseform, but this caused other issues because the actual baseform was left as a tag. hfst-tokenise fixed that issue and we could very easily work around prefixes.

CG has always had somewhat strict stream format. CG-2 required first tag to be either [baseform] or "baseform". CG-3 changed this to only support "baseform", in order to allow more mixed content.

@snomos
Copy link
Member Author

snomos commented Dec 18, 2024

I see that both kal-tokenise and kal-generate are perl scripts. That is not very portable for standalone grammar checkers. I understand that there is more to these scripts than just moving prefix tags, but would it be ok to replace that part of the scripts with some simple (Rust/C/whatever) code to move prefix tags back and forth as needed, and leave the rest for now?

In the end I would like to have all the functionality of the perl scripts encoded in one of FST/CG/compiled binary, but I suggest we start with the prefix tags and see how that works.

@TinoDidriksen
Copy link
Member

Sure.

There are other things we need in the final pipe, though, such as the https://github.com/Oqaasileriffik/katersat semantic tags module. I would prefer to get <sh> implemented in libdivvun so we can test the actual pipe, before needing to port all the Perl and Python parts to C++. But I guess my yule project could be to port it all.

@snomos
Copy link
Member Author

snomos commented Dec 18, 2024

What is the semantic tag module, and how is it used?

@flammie
Copy link
Contributor

flammie commented Dec 19, 2024

I would prefer to get <sh> implemented in libdivvun so we can test the actual pipe,

I think I implemented most of it at on the way some time ago but it needs testing (and development) since it's the least portable and standardised code.

@TinoDidriksen
Copy link
Member

What is the semantic tag module, and how is it used?

Almost all of our semantic tags comes from our online dictionary interface, Katersat, and not the FST. Katersat is easier for everyone to work with. Student helpers and others can easily be taught how to tag semantics, provide translations, and explanations in Katersat, without needing to know how to change the FST. It's also vastly faster during development.

We then extract those semantic tags from Katersat and apply them to the output of the FST, so the pipe can make use of them for disambiguation.

@TinoDidriksen
Copy link
Member

Added testing kalgram-full pipe (ping @Juutitta) in 32560d3 - it assumes our ~/langtech/ setup for now.

Still need to modify divvun-suggest to understand Prefix/* tags or prefixed tags, and port everything else to C++.

@flammie
Copy link
Contributor

flammie commented Jan 20, 2025

This is what I get on current libdivvun:

$ echo "taakku marluk inuunerminni taama pilluartigisimanngisaannarput" | modes/kalgram-full.mode 
dependency.cg3: Warning: Barriers only make sense for scanning or self tests on line 11238 at 1A (/"piareer"\ Gram/IV\ SAR\ Der/vv\ Gram/TV\ Gram/Refl\ V/l) + VFIN BARRIER KOMMA OR VFIN.
functions.cg3: Warning: Barriers only make sense for scanning or self tests on line 8501 at 1 (/"piareer"\ Gram/IV\ SAR\ Der/vv\ Gram/TV\ Gram/Refl\ V/l) + VFIN BARRIER KOMMA OR VFIN.
"<taakku>"
	"una" Prefix/TA Gram/Dem Pron Abs Pl <W:0.0> @Pron>N #1->2
: 
"<marluk>"
	"marluk" Num Abs Pl <W:0.0> @SUBJ> #2->5
: 
"<inuunerminni>"
	"inuk" U Der/nv Gram/IV NIQ Der/vn N Lok Sg 4PlPoss <W:0.0> @ADVL> #3->5
: 
"<taama>"
	"taama" Adv <W:0.0> @>V #4->5
: 
"<pilluartigisimanngisaannarput>"
	"pilluar" Gram/IV TIGE Der/vv Gram/IV SIMA Der/vv NNGISAANNAR Der/vv Gram/IV V Ind 3Pl <W:0.0> @PRED #5->0
:\n

I hacked the mode file to point the kal-prefix-propagate correctly though and commented out the python script I was missing, if it's relevant for this issue I can check it out later.

@TinoDidriksen
Copy link
Member

Yep, that's as it should be.

But the rest of libdivvun needs to handle Prefix/* and prefixed tags. E.g., a typo in a long prefixed word tamatumuunakkut => "manna" Prefix/TA Gram/Dem Sem/ac Pron Via Sg:

$ echo 'tamatumuunakkutt' | bash modes/kalgram-full.mode
"<tamatumuunakkutt>"
        "tamatumuunakkutt" ?

Which seems broken, as the unprefixed word matumuunakkut => "manna" Gram/Dem Pron Via Sg is handled fine when typoed to matumuunakkutt. So all CG parsers and generators need to move prefixes around. Luckily, that's simple enough to do.

@flammie
Copy link
Contributor

flammie commented Jan 21, 2025

ah I think I get it now, at least the grammar checker needs to restore the prefix tags back to original places before generating, probably the speller component does something with the tags too, I cannot remember how it works exactly.

@flammie
Copy link
Contributor

flammie commented Jan 21, 2025

I made cgspell part throw all tags before lemma to the other side with Prefix/ now.

@snomos
Copy link
Member Author

snomos commented Jan 21, 2025

I made cgspell part throw all tags before lemma to the other side with Prefix/ now.

In what order? Same as before throw, or reversed, or undefined?

@flammie
Copy link
Contributor

flammie commented Jan 21, 2025

I made cgspell part throw all tags before lemma to the other side with Prefix/ now.

In what order? Same as before throw, or reversed, or undefined?

Should come in the same order for now. I guess up until the grammar correction generator it's just handled by VISL CG 3 rules so it's ordering agnostic.

@snomos
Copy link
Member Author

snomos commented Jan 21, 2025

CG is order agnostic, but the generator is not, so we need the order to be fixed. As long as we agree what that fixed order is there is no problem 🙂 Same order seems fine and logical.

@flammie
Copy link
Contributor

flammie commented Jan 30, 2025

Is there an example of prefixed word generation issue for the grammar checker part? The spell-checker should work now.

@Juutitta
Copy link
Contributor

Is there an example of prefixed word generation issue for the grammar checker part? The spell-checker should work now.

Seems to be working:
tools/grammarcheckers - (main) > echo "taakku marluk inuunerminni taama pilluartigisimanngisaannarput" | modes/kalgram-full.mode

""
"una" Prefix/TA Gram/Dem Pron Abs Pl <W:0.0> @pron>N #1->2
""
"marluk" Sem/ac-sign Num Abs Pl <W:0.0> @subj> #2->5
""
"inuk" U Der/nv Gram/IV NIQ Der/vn Sem/ac N Lok Sg 4PlPoss <W:0.0> @advl> #3->5
""
"taama" Adv <W:0.0> @>V #4->5
""
"pilluar" Gram/IV iSem/emote TIGE Der/vv Gram/IV SIMA Der/vv NNGISAANNAR Der/vv Gram/IV V Ind 3Pl <f:be_attribute_jpsych> <§TH_@SUBJ_N_Abs> <§TH@SUBJ-NULL__N_Pron_Prop> <W:0.0> @pred #5->0
:\n

This was before:
tools/grammarcheckers - (main) > echo "taakku marluk inuunerminni taama pilluartigisimanngisaannarput" | modes/kalgram.mode
Warning: No soft or hard delimiters defined in grammar. Hard limit of 500 cohorts may break windows in unintended places.
Warning: No soft or hard delimiters defined in grammar. Hard limit of 500 cohorts may break windows in unintended places.
""
:
""
"marluk" Num Abs Pl <W:0.0>
:
""
"inuk" U Der/nv Gram/IV NIQ Der/vn N Lok Sg 4PlPoss <W:0.0>
:
""
"taama" Adv <W:0.0>
:
""
"pilluar" Gram/IV TIGE Der/vv SIMA Der/vv NNGISAANNAR Der/vv Gram/IV V Ind 3Pl <W:0.0>
:\n

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants