Skip to content

Latest commit

 

History

History
60 lines (49 loc) · 2.31 KB

tools-tokenisers-tokeniser-gramcheck-gt-desc.pmscript.md

File metadata and controls

60 lines (49 loc) · 2.31 KB

Grammar checker tokenisation for fkv

Requires a recent version of HFST (3.10.0 / git revision>=3aecdbc) Then just:

$ make
$ echo "ja, ja" | hfst-tokenise --giella-cg tokeniser-disamb-gt-desc.pmhfst

More usage examples:

$ echo "Juos gorreválggain lea (dárbbašlaš) deavdit gáibádusa boasttu olmmoš, man mielde lahtuid." | hfst-tokenise --giella-cg tokeniser-disamb-gt-desc.pmhfst
$ echo "(gáfe) 'ja' ja 3. ja? ц jaja ukjend \"ukjend\"" | hfst-tokenise --giella-cg tokeniser-disamb-gt-desc.pmhfst
$ echo "márffibiillagáffe" | hfst-tokenise --giella-cg tokeniser-disamb-gt-desc.pmhfst

Pmatch documentation: https://github.com/hfst/hfst/wiki/HfstPmatch

Characters which have analyses in the lexicon, but can appear without spaces before/after, that is, with no context conditions, and adjacent to words:

  • Punct contains ASCII punctuation marks
  • The symbol after m-dash is soft-hyphen U+00AD
  • The symbol following {•} is byte-order-mark / zero-width no-break space U+FEFF.

Whitespace contains ASCII white space and the List contains some unicode white space characters

  • En Quad U+2000 to Zero-Width Joiner U+200d'
  • Narrow No-Break Space U+202F
  • Medium Mathematical Space U+205F
  • Word joiner U+2060

Apart from what's in our morphology, there are

  1. unknown word-like forms, and
  2. unmatched strings We want to give 1) a match, but let 2) be treated specially by hfst-tokenise -a
  • select extended latin symbols
  • select symbols
  • various symbols from Private area (probably Microsoft), so far:
  • U+F0B7 for "x in box"

TODO: Could use something like this, but built-in's don't include šžđčŋ:

Simply give an empty reading when something is unknown: hfst-tokenise --giella-cg will treat such empty analyses as unknowns, and remove empty analyses from other readings. Empty readings are also legal in CG, they get a default baseform equal to the wordform, but no tag to check, so it's safer to let hfst-tokenise handle them.

Finally we mark as a token any sequence making up a:

  • known word in context
  • unknown (OOV) token in context
  • sequence of word and punctuation
  • URL in context

This (part of) documentation was generated from tools/tokenisers/tokeniser-gramcheck-gt-desc.pmscript