Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

telefonnr-analysator for alle språk #2

Open
ilm024 opened this issue Sep 10, 2024 · 4 comments
Open

telefonnr-analysator for alle språk #2

ilm024 opened this issue Sep 10, 2024 · 4 comments
Assignees

Comments

@ilm024
Copy link

ilm024 commented Sep 10, 2024

Vi mangler en telefonr-analusator for alle språk. Enten i shared-smi elelr shared-mul.

Nå ser det slik ut i lulesamisk, og der blir svenske telefonnr særlig utfordrende da disse får blir "typos" da de begynner med 0:

"<tel.>"
        "tel" N <smj> <smj> Sem/Obj-el ABBR Gram/TNumAbbr Sg Gen <W:0.0> SELECT:3805 SUBSTITUTE:4355 SUBSTITUTE:4354
;       "tel" N Sem/Obj-el ABBR Gram/TNumAbbr Attr <W:0.0> SELECT:3805
;       "tel" N Sem/Obj-el ABBR Gram/TNumAbbr Sg Acc <W:0.0> SELECT:3805
;       "tel" N Sem/Obj-el ABBR Gram/TNumAbbr Sg Nom <W:0.0> REMOVE:3661
;       "." CLB <W:0.0> "<.>"
;               "tel" N Sem/Obj-el ABBR Gram/TNumAbbr Attr <W:0.0> "<tel>" REMOVE:2110:longest-match
;       "." CLB <W:0.0> "<.>"
;               "tel" N Sem/Obj-el ABBR Gram/TNumAbbr Sg Acc <W:0.0> "<tel>" REMOVE:2110:longest-match
;       "." CLB <W:0.0> "<.>"
;               "tel" N Sem/Obj-el ABBR Gram/TNumAbbr Sg Gen <W:0.0> "<tel>" REMOVE:2110:longest-match
;       "." CLB <W:0.0> "<.>"
;               "tel" N Sem/Obj-el ABBR Gram/TNumAbbr Sg Nom <W:0.0> "<tel>" REMOVE:2110:longest-match
;       "." CLB <W:0.0> "<.>"
;               "tel" N Sem/Obj-el ABBR Gram/TNumAbbr Pl Nom <W:0.0> "<tel>" REMOVE:2110:longest-match
: 
"<073-786>"             073-786 →  -73-786      →  73-786
        "-73-786" Num Arab Sg Nom <W:32.5909> <WA:22.5909> <spelled> "-73-786"S PROTECT:1251 SELECT:1301 ADD:6:spelled SELECT:1455 &SUGGESTWF &typo
typo
        "73-786" Num Arab Sg Nom <W:32.5909> <WA:22.5909> <spelled> "73-786"S PROTECT:1251 SELECT:1301 ADD:6:spelled SELECT:1455 &SUGGESTWF &typo
typo
;       "-73-786" Num Arab Sg Ela Attr <W:32.5909> <WA:22.5909> <spelled> "-73-786"S PROTECT:1251 SELECT:1301 &SUGGESTWF &typo ADD:6:spelled SELECT:1455
;       "-73-786" Num Arab Sg Gen <W:32.5909> <WA:22.5909> <spelled> "-73-786"S PROTECT:1251 SELECT:1301 &SUGGESTWF &typo ADD:6:spelled SELECT:1455
;       "-73-786" Num Arab Sg Ine Attr <W:32.5909> <WA:22.5909> <spelled> "-73-786"S PROTECT:1251 SELECT:1301 &SUGGESTWF &typo ADD:6:spelled SELECT:1455
;       "-73-786" Num Arab Sg Ill Attr <W:32.5909> <WA:22.5909> <spelled> "-73-786"S PROTECT:1251 SELECT:1301 &SUGGESTWF &typo ADD:6:spelled SELECT:1455
;       "73-786" Num Arab Sg Ela Attr <W:32.5909> <WA:22.5909> <spelled> "73-786"S PROTECT:1251 SELECT:1301 &SUGGESTWF &typo ADD:6:spelled SELECT:1455
;       "73-786" Num Arab Sg Gen <W:32.5909> <WA:22.5909> <spelled> "73-786"S PROTECT:1251 SELECT:1301 &SUGGESTWF &typo ADD:6:spelled SELECT:1455
;       "73-786" Num Arab Sg Ine Attr <W:32.5909> <WA:22.5909> <spelled> "73-786"S PROTECT:1251 SELECT:1301 &SUGGESTWF &typo ADD:6:spelled SELECT:1455
;       "73-786" Num Arab Sg Ill Attr <W:32.5909> <WA:22.5909> <spelled> "73-786"S PROTECT:1251 SELECT:1301 &SUGGESTWF &typo ADD:6:spelled SELECT:1455
;       "073-786" ? SELECT:1301
: 
"<58>"
        "58" Num Arab Sg Nom <W:0.0> SELECT:1454:Arab SELECT:1456
;       "58" Num Arab Sg Ela Attr <W:0.0> SELECT:1454:Arab SELECT:1456
;       "58" Num Arab Sg Gen <W:0.0> SELECT:1454:Arab SELECT:1456
;       "58" Num Arab Sg Ill Attr <W:0.0> SELECT:1454:Arab SELECT:1456
;       "58" Num Arab Sg Ine Attr <W:0.0> SELECT:1454:Arab SELECT:1456
;       "58" Num Sem/ID <W:0.0> SELECT:1454:Arab
;       "58" A Arab Ord Attr CLBfinal <W:0.0> REMOVE:2067:spurious-adj-reading
: 
"<10.>"
        "10" A <smj> <smj> Arab Ord Attr <W:0.0> SUBSTITUTE:4354 SUBSTITUTE:4353
@snomos
Copy link
Member

snomos commented Sep 10, 2024

Fyrste del av telefonnummeret blir rett og slett ikkje kjent igjen av analysatoren, slik at det er stavekontrollen som blir brukt til å generera "retteforslag, jf <spelled>.

@flammie
Copy link
Contributor

flammie commented Sep 10, 2024

teknisk er det ganske enkelt å laga lexicon eller regulære uttrykk av telefonnummerformata, største problem har vart jo at i shared det blir problematisk for en eller annet bruk, til eksempel, det finnes allerede ukommentert telefonnummerleksikon i shared-smi: https://github.com/giellalt/shared-smi/blob/main/src/fst/stems/arabic_roman_digits.lexc#L354-L368, (den er for gammelt for att æ kunne finne ut kem som har utkomentert den men kanskje det er noen som vet bakgrunn til det her?)

@snomos
Copy link
Member

snomos commented Sep 11, 2024

teknisk er det ganske enkelt å laga lexicon eller regulære uttrykk av telefonnummerformata, største problem har vart jo at i shared det blir problematisk for en eller annet bruk

Det er berre å ignorera utkommenterte, gamle ting. Vi treng ein felles telefonnummerparsar, så om du kan leggja til ein i shared-mul hadde det vore kjempefint.

Og så må telefonnumra sjølvsagt taggast slik at det er lett å disambiguera dei, eller heilt fjerna dei frå fst-en.

@flammie
Copy link
Contributor

flammie commented Sep 11, 2024

den er i shared-mul og lang-smj nå:

$ echo tel. 073-786 58 10 | hfst-tokenise -g tools/tokenisers/tokeniser-disamb-gt-desc.pmhfst 
"<tel.>"
	"." CLB <W:0.0> "<.>"
		"tel" N Sem/Obj-el ABBR Gram/TNumAbbr Attr <W:0.0> "<tel>"
	"tel" N Sem/Obj-el ABBR Gram/TNumAbbr Attr <W:0.0>
	"." CLB <W:0.0> "<.>"
		"tel" N Sem/Obj-el ABBR Gram/TNumAbbr Sg Acc <W:0.0> "<tel>"
	"tel" N Sem/Obj-el ABBR Gram/TNumAbbr Sg Acc <W:0.0>
	"." CLB <W:0.0> "<.>"
		"tel" N Sem/Obj-el ABBR Gram/TNumAbbr Sg Gen <W:0.0> "<tel>"
	"tel" N Sem/Obj-el ABBR Gram/TNumAbbr Sg Gen <W:0.0>
	"." CLB <W:0.0> "<.>"
		"tel" N Sem/Obj-el ABBR Gram/TNumAbbr Sg Nom <W:0.0> "<tel>"
	"tel" N Sem/Obj-el ABBR Gram/TNumAbbr Sg Nom <W:0.0>
	"." CLB <W:0.0> "<.>"
		"tel" N Sem/Obj-el ABBR Gram/TNumAbbr Sg Nom <W:0.0> "<tel>"
	"." CLB <W:0.0> "<.>"
		"tel" N Sem/Obj-el ABBR Gram/TNumAbbr Sg Gen <W:0.0> "<tel>"
	"." CLB <W:0.0> "<.>"
		"tel" N Sem/Obj-el ABBR Gram/TNumAbbr Pl Nom <W:0.0> "<tel>"
	"." CLB <W:0.0> "<.>"
		"tel" N Sem/Obj-el ABBR Gram/TNumAbbr Attr <W:0.0> "<tel>"
: 
"<073-786 58 10>"
	"073-786 58 10" Num Arab TEL <W:0.0>
:\n

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants