Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kgrams_counts bug #5

Open
wzy816 opened this issue May 3, 2019 · 1 comment
Open

kgrams_counts bug #5

wzy816 opened this issue May 3, 2019 · 1 comment

Comments

@wzy816
Copy link

wzy816 commented May 3, 2019

Maybe it is trivial and I am wrong.

From the paper I think the count of a k-gram "word" is its occurrence in the corpus data not in its higher-order gram types. If this is the case,

new_order[suffix] += 1
should be changed to

new_order[suffix] += last_order[ngram]

But even this is troublesome. For example, ('?', '', '') in the last_order will add its suffix ('', '') to the new_order. But I think two pad symbol is not valid in a bigram model.

Therefore, I think a better way to do kgram count is to do each order independently and directly from corpus data.
And in the class KneserNeyLM definition, using highest_order gram ngrams as arg and in the example.py usinggut_ngrams need to be revised as well.

@robinn37
Copy link

It is the adjusted count or unique prefix count for low order here. I think +1 is correct here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants