Skip to content

Latest commit

 

History

History
197 lines (165 loc) · 4.95 KB

API-hyphenation.md

File metadata and controls

197 lines (165 loc) · 4.95 KB

Hyphenating texts using the API server

For the time being, only North Sámi can be hyphenated as shown below. For other languages than North Sámi, see this document. We hope to add support for more languages soon.

Basic command

curl -s -X POST -H 'Content-Type: application/json' \
     -i 'https://api-giellalt.uit.no/hyphenation/hyphenator-gt-desc' \
     --data '{"text": "mun hálan davvisámegiela"}' |\
     grep '{' |\
     jq '.' 

Comments:

  • we use curl to access the REST API, with the -s parameter to silence it.
  • --data contains the actual text to be hyphenated. It can be long, but should preferably be restricted to single paragraphs for execution time reasons.
  • grep is just to get rid of curl metadata from the processing
  • jq . to pretty print the output

Output:

{
  "text": "mun hálan davvisámegiela",
  "results": [
    {
      "word": "mun",
      "hyphenations": [
        {
          "value": "mun",
          "weight": 0.0
        },
        {
          "value": "mun",
          "weight": 5000.0
        }
      ]
    },
    {
      "word": "hálan",
      "hyphenations": [
        {
          "value": "há^lan",
          "weight": 0.0
        },
        {
          "value": "há^lan",
          "weight": 5000.0
        }
      ]
    },
    {
      "word": "davvisámegiela",
      "hyphenations": [
        {
          "value": "dav^vi#sá^me#gie^la",
          "weight": 0.0
        },
        {
          "value": "dav^vi^sá^me^gie^la",
          "weight": 5000.0
        }
      ]
    }
  ]
}

This is the raw output from the API server. Comments on the output:

  • both input text and output data is listed
  • hyphenation points are indicated with two symbols:
    • #: primary hyphenation point (usually a word boundary)
    • ^: secondary hyphenation point
  • for each input word, all hyphenation patterns are listed, from best to worst
  • the weight is a very rough indication of priority, with 0.0 being the best
  • there will most often be at least two hyphenation patterns, one from the lexical lookup (those with weight 0.0), and one from the pattern-based fallback (weight 5000.0 or higher). For unrecognised misspellings or unknown words, only the pattern-based fallback is provided.

Filtered examples

curl -s -X POST -H 'Content-Type: application/json' \
    -i 'https://api-giellalt.uit.no/hyphenation/hyphenator-gt-desc' \
    --data '{"text": "mun hálan davvisámegiela"}' |\
    grep '{' |\
    jq '.results[].hyphenations | map(select(.value)) | first'

Comment:

  • we use jq filtering to only retain the most likely hyphenation pattern, with weights

Output:

{
  "value": "mun",
  "weight": 0.0
}
{
  "value": "há^lan",
  "weight": 0.0
}
{
  "value": "dav^vi#sá^me#gie^la",
  "weight": 0.0
}

The same example, but now with a misspelling; notice the change in weight for the last word:

curl -s -X POST -H 'Content-Type: application/json' \
    -i 'https://api-giellalt.uit.no/hyphenation/hyphenator-gt-desc' \
    --data '{"text": "mun hálan davvisámegiellla"}' |\
    grep '{' |\
    jq '.results[].hyphenations | map(select(.value)) | first'

Output:

{
  "value": "mun",
  "weight": 0.0
}
{
  "value": "há^lan",
  "weight": 0.0
}
{
  "value": "dav^vi^sá^me^giell^la",
  "weight": 5000.0
}

If you only want the hyphenated input text, and not the json stuff, use the following jq filtering:

curl -s -X POST -H 'Content-Type: application/json' \
    -i 'https://api-giellalt.uit.no/hyphenation/hyphenator-gt-desc' \
    --data '{"text": "mun hálan davvisámegiela"}' |\
    grep '{' |\
    jq '.results[].hyphenations | map(select(.value).value) | first'

Output:

"mun"
"há^lan"
"dav^vi#sá^me#gie^la"

Add -r/--raw-output to jq if you want to get rid of the quotes:

curl -s -X POST -H 'Content-Type: application/json' \
    -i 'https://api-giellalt.uit.no/hyphenation/hyphenator-gt-desc' \
    --data '{"text": "mun hálan davvisámegiela"}' |\
    grep '{' |\
    jq -r '.results[].hyphenations | map(select(.value).value) | first'

Output:

mun
há^lan
dav^vi#sá^me#gie^la

If you have a text file that you would like to have hyphenated, do as follows:

cat textfile.txt |\
    (printf '{"text": "' && cat && printf '"}') |\
    curl -s -X POST -H 'Content-Type: application/json' \
    -i 'https://api-giellalt.uit.no/hyphenation/hyphenator-gt-desc' \
    --data @- |\                                    
    grep '{' |\
    jq '.results[].hyphenations | map(select(.value).value) | first'

Comments:

  • the printf stuff after the initial cat is there to wrap the file content in a simple json structure, as that is what is expected on the other end.
  • add -r/--raw-output to jq if you want to get rid of the quotes (cf above)

Output (assuming the textfile.txt file has the same content as the example sentence used above):

"mun"
"há^lan"
"dav^vi#sá^me#gie^la"