kgrams_counts bug #5

wzy816 · 2019-05-03T09:22:13Z

Maybe it is trivial and I am wrong.

From the paper I think the count of a k-gram "word" is its occurrence in the corpus data not in its higher-order gram types. If this is the case,

kneser-ney/kneser_ney.py

Line 58 in 2740fba

new_order[suffix] += 1

should be changed to

new_order[suffix] += last_order[ngram]

But even this is troublesome. For example, ('?', '', '') in the last_order will add its suffix ('', '') to the new_order. But I think two pad symbol is not valid in a bigram model.

Therefore, I think a better way to do kgram count is to do each order independently and directly from corpus data.
And in the class KneserNeyLM definition, using highest_order gram ngrams as arg and in the example.py usinggut_ngrams need to be revised as well.

robinn37 · 2019-06-26T09:43:45Z

It is the adjusted count or unique prefix count for low order here. I think +1 is correct here.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

kgrams_counts bug #5

kgrams_counts bug #5

wzy816 commented May 3, 2019 •

edited

Loading

robinn37 commented Jun 26, 2019

kgrams_counts bug #5

kgrams_counts bug #5

Comments

wzy816 commented May 3, 2019 • edited Loading

robinn37 commented Jun 26, 2019

wzy816 commented May 3, 2019 •

edited

Loading