'taakku' — all analyses lost in grammar checker #5

snomos · 2024-12-18T12:36:51Z

Cf this:

echo "taakku marluk inuunerminni taama pilluartigisimanngisaannarput" | modes/trace-kalgram0-morph.mode
"<taakku>"
    TA "una" Gram/Dem Pron Abs Pl <W:0.0>
    TA "una" Gram/Dem Pron Rel Pl <W:0.0>
:
…

In this first step everything is correct. But in the next step something strange is happening:

echo "taakku marluk inuunerminni taama pilluartigisimanngisaannarput" | modes/trace-kalgram1-blanktag.mode
    TA "una" Gram/Dem Pron Abs Pl <W:0.0>
    TA "una" Gram/Dem Pron Rel Pl <W:0.0>
"<taakku>"
:
…

The word form has been moved to after the analyser output. I have no idea why and how. In any case, this leads to the analyses getting lost later, leaving the bare word form:

echo "taakku marluk inuunerminni taama pilluartigisimanngisaannarput" | modes/trace-kalgram.mode 
"<taakku>"
: 
"<marluk>"
	"marluk" Num Abs Pl <W:0.0>
;	"marluk" Num Rel Pl <W:0.0> REMOVE:2385:tidlig0020A
;	"marluk" Orth/Alt N Abs Sg <W:0.0> REMOVE:2189:0001P
: 
…

Any idea, @TinoDidriksen @unhammer @flammie ?

TinoDidriksen · 2024-12-18T12:44:29Z

TA "una" Gram/Dem Pron Abs Pl <W:0.0> is not a valid CG-reading, as it doesn't start with ". Thus it is treated as text and moved out of the cohort it was in.

Our own pipe's kal-tokenise moves such non-baseform prefixes, yielding "una" Prefix/TA Gram/Dem Pron Abs Pl <W:0.0>. And it performs other needed corrections.

snomos · 2024-12-18T12:50:58Z

I hadn't realised that the CG reading syntax requires it to start with " - this is quite problematic when using as input FST analysis of prefix-heavy languages. What do you do then? In a grammar checker context tag order is important, as the tag order needs to be retained for word form generation at the end of the processing.

flammie · 2024-12-18T12:54:45Z

the analysis in xerox format is:

$ echo taakku | hfst-lookup src/fst/analyser-gt-desc.hfstol 
taakku	TA+una+Gram/Dem+Pron+Abs+Pl	0,000000
taakku	TA+una+Gram/Dem+Pron+Rel+Pl	0,000000

I can guess that hfst-tokenise needs to do some guessworks to find out which parts of this analysis are lemma or tags based on common practices that don't include this kind of combination. I'm not sure what is the correct lemma/tags here either?

snomos · 2024-12-18T13:01:59Z

The assumption for hfst-tokenise is very simple, and automatically handled in the FST pipeline:

all multichar symbols are tags
sequences of non-multichars are strings/word forms
there should be one and only one such string pr line in the analysis cohort

Placement of string in the analysis is not considered in the hfst-tokenise output, exactly because of prefixing languages.

Accented chars in Unicode using combining diacritics are always automatically converted to a sequence of symbols (in the FST sense), both to support the criteria above and to make parsing of input text simple and straightforward.

TinoDidriksen · 2024-12-18T13:03:00Z

We have kal-generate to move them back and turn CG into FST for generation. Greenlandic only has 2 prefixes, AA and TA, and they are definitely not the baseform.

A decade ago it used to be analyzed as "TA" una Gram/Dem Pron Abs Pl because we simply defined the first tag as baseform, but this caused other issues because the actual baseform was left as a tag. hfst-tokenise fixed that issue and we could very easily work around prefixes.

CG has always had somewhat strict stream format. CG-2 required first tag to be either [baseform] or "baseform". CG-3 changed this to only support "baseform", in order to allow more mixed content.

snomos · 2024-12-18T13:41:47Z

I see that both kal-tokenise and kal-generate are perl scripts. That is not very portable for standalone grammar checkers. I understand that there is more to these scripts than just moving prefix tags, but would it be ok to replace that part of the scripts with some simple (Rust/C/whatever) code to move prefix tags back and forth as needed, and leave the rest for now?

In the end I would like to have all the functionality of the perl scripts encoded in one of FST/CG/compiled binary, but I suggest we start with the prefix tags and see how that works.

TinoDidriksen · 2024-12-18T14:14:13Z

Sure.

There are other things we need in the final pipe, though, such as the https://github.com/Oqaasileriffik/katersat semantic tags module. I would prefer to get <sh> implemented in libdivvun so we can test the actual pipe, before needing to port all the Perl and Python parts to C++. But I guess my yule project could be to port it all.

snomos · 2024-12-18T21:02:26Z

What is the semantic tag module, and how is it used?

flammie · 2024-12-19T00:11:13Z

I would prefer to get <sh> implemented in libdivvun so we can test the actual pipe,

I think I implemented most of it at on the way some time ago but it needs testing (and development) since it's the least portable and standardised code.

TinoDidriksen · 2024-12-19T12:46:37Z

What is the semantic tag module, and how is it used?

Almost all of our semantic tags comes from our online dictionary interface, Katersat, and not the FST. Katersat is easier for everyone to work with. Student helpers and others can easily be taught how to tag semantics, provide translations, and explanations in Katersat, without needing to know how to change the FST. It's also vastly faster during development.

We then extract those semantic tags from Katersat and apply them to the output of the FST, so the pipe can make use of them for disambiguation.

TinoDidriksen · 2025-01-20T14:21:36Z

Added testing kalgram-full pipe (ping @Juutitta) in 32560d3 - it assumes our ~/langtech/ setup for now.

Still need to modify divvun-suggest to understand Prefix/* tags or prefixed tags, and port everything else to C++.

flammie · 2025-01-20T17:53:43Z

This is what I get on current libdivvun:

$ echo "taakku marluk inuunerminni taama pilluartigisimanngisaannarput" | modes/kalgram-full.mode 
dependency.cg3: Warning: Barriers only make sense for scanning or self tests on line 11238 at 1A (/"piareer"\ Gram/IV\ SAR\ Der/vv\ Gram/TV\ Gram/Refl\ V/l) + VFIN BARRIER KOMMA OR VFIN.
functions.cg3: Warning: Barriers only make sense for scanning or self tests on line 8501 at 1 (/"piareer"\ Gram/IV\ SAR\ Der/vv\ Gram/TV\ Gram/Refl\ V/l) + VFIN BARRIER KOMMA OR VFIN.
"<taakku>"
	"una" Prefix/TA Gram/Dem Pron Abs Pl <W:0.0> @Pron>N #1->2
: 
"<marluk>"
	"marluk" Num Abs Pl <W:0.0> @SUBJ> #2->5
: 
"<inuunerminni>"
	"inuk" U Der/nv Gram/IV NIQ Der/vn N Lok Sg 4PlPoss <W:0.0> @ADVL> #3->5
: 
"<taama>"
	"taama" Adv <W:0.0> @>V #4->5
: 
"<pilluartigisimanngisaannarput>"
	"pilluar" Gram/IV TIGE Der/vv Gram/IV SIMA Der/vv NNGISAANNAR Der/vv Gram/IV V Ind 3Pl <W:0.0> @PRED #5->0
:\n

I hacked the mode file to point the kal-prefix-propagate correctly though and commented out the python script I was missing, if it's relevant for this issue I can check it out later.

TinoDidriksen · 2025-01-20T18:49:14Z

Yep, that's as it should be.

But the rest of libdivvun needs to handle Prefix/* and prefixed tags. E.g., a typo in a long prefixed word tamatumuunakkut => "manna" Prefix/TA Gram/Dem Sem/ac Pron Via Sg:

$ echo 'tamatumuunakkutt' | bash modes/kalgram-full.mode
"<tamatumuunakkutt>"
        "tamatumuunakkutt" ?

Which seems broken, as the unprefixed word matumuunakkut => "manna" Gram/Dem Pron Via Sg is handled fine when typoed to matumuunakkutt. So all CG parsers and generators need to move prefixes around. Luckily, that's simple enough to do.

flammie · 2025-01-21T01:25:37Z

ah I think I get it now, at least the grammar checker needs to restore the prefix tags back to original places before generating, probably the speller component does something with the tags too, I cannot remember how it works exactly.

flammie · 2025-01-21T11:45:11Z

I made cgspell part throw all tags before lemma to the other side with Prefix/ now.

snomos · 2025-01-21T12:27:04Z

I made cgspell part throw all tags before lemma to the other side with Prefix/ now.

In what order? Same as before throw, or reversed, or undefined?

flammie · 2025-01-21T12:44:35Z

I made cgspell part throw all tags before lemma to the other side with Prefix/ now.

In what order? Same as before throw, or reversed, or undefined?

Should come in the same order for now. I guess up until the grammar correction generator it's just handled by VISL CG 3 rules so it's ordering agnostic.

snomos · 2025-01-21T12:58:38Z

CG is order agnostic, but the generator is not, so we need the order to be fixed. As long as we agree what that fixed order is there is no problem 🙂 Same order seems fine and logical.

flammie · 2025-01-30T14:29:06Z

Is there an example of prefixed word generation issue for the grammar checker part? The spell-checker should work now.

Juutitta · 2025-01-30T15:51:33Z

Is there an example of prefixed word generation issue for the grammar checker part? The spell-checker should work now.

Seems to be working:
tools/grammarcheckers - (main) > echo "taakku marluk inuunerminni taama pilluartigisimanngisaannarput" | modes/kalgram-full.mode

""
"una" Prefix/TA Gram/Dem Pron Abs Pl <W:0.0> @pron>N #1->2
""
"marluk" Sem/ac-sign Num Abs Pl <W:0.0> @subj> #2->5
""
"inuk" U Der/nv Gram/IV NIQ Der/vn Sem/ac N Lok Sg 4PlPoss <W:0.0> @advl> #3->5
""
"taama" Adv <W:0.0> @>V #4->5
""
"pilluar" Gram/IV iSem/emote TIGE Der/vv Gram/IV SIMA Der/vv NNGISAANNAR Der/vv Gram/IV V Ind 3Pl <f:be_attribute_jpsych> <§TH_@SUBJ_N_Abs> <§TH@SUBJ-NULL__N_Pron_Prop> <W:0.0> @pred #5->0
:\n

This was before:
tools/grammarcheckers - (main) > echo "taakku marluk inuunerminni taama pilluartigisimanngisaannarput" | modes/kalgram.mode
Warning: No soft or hard delimiters defined in grammar. Hard limit of 500 cohorts may break windows in unintended places.
Warning: No soft or hard delimiters defined in grammar. Hard limit of 500 cohorts may break windows in unintended places.
""
:
""
"marluk" Num Abs Pl <W:0.0>
:
""
"inuk" U Der/nv Gram/IV NIQ Der/vn N Lok Sg 4PlPoss <W:0.0>
:
""
"taama" Adv <W:0.0>
:
""
"pilluar" Gram/IV TIGE Der/vv SIMA Der/vv NNGISAANNAR Der/vv Gram/IV V Ind 3Pl <W:0.0>
:\n

snomos added the bug Something isn't working label Dec 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

'taakku' — all analyses lost in grammar checker #5

'taakku' — all analyses lost in grammar checker #5

snomos commented Dec 18, 2024 •

edited

Loading

TinoDidriksen commented Dec 18, 2024

snomos commented Dec 18, 2024

flammie commented Dec 18, 2024

snomos commented Dec 18, 2024

TinoDidriksen commented Dec 18, 2024

snomos commented Dec 18, 2024

TinoDidriksen commented Dec 18, 2024

snomos commented Dec 18, 2024

flammie commented Dec 19, 2024

TinoDidriksen commented Dec 19, 2024

TinoDidriksen commented Jan 20, 2025

flammie commented Jan 20, 2025

TinoDidriksen commented Jan 20, 2025

flammie commented Jan 21, 2025

flammie commented Jan 21, 2025

snomos commented Jan 21, 2025

flammie commented Jan 21, 2025

snomos commented Jan 21, 2025

flammie commented Jan 30, 2025

Juutitta commented Jan 30, 2025

'taakku' — all analyses lost in grammar checker #5

'taakku' — all analyses lost in grammar checker #5

Comments

snomos commented Dec 18, 2024 • edited Loading

TinoDidriksen commented Dec 18, 2024

snomos commented Dec 18, 2024

flammie commented Dec 18, 2024

snomos commented Dec 18, 2024

TinoDidriksen commented Dec 18, 2024

snomos commented Dec 18, 2024

TinoDidriksen commented Dec 18, 2024

snomos commented Dec 18, 2024

flammie commented Dec 19, 2024

TinoDidriksen commented Dec 19, 2024

TinoDidriksen commented Jan 20, 2025

flammie commented Jan 20, 2025

TinoDidriksen commented Jan 20, 2025

flammie commented Jan 21, 2025

flammie commented Jan 21, 2025

snomos commented Jan 21, 2025

flammie commented Jan 21, 2025

snomos commented Jan 21, 2025

flammie commented Jan 30, 2025

Juutitta commented Jan 30, 2025

snomos commented Dec 18, 2024 •

edited

Loading