Full stops after numbers unnoticed, extra ones predicted #9
Labels
enhancement
New feature or request
good first issue
Good for newcomers
help wanted
Extra attention is needed
Hi and thanks a lot for the great tool!
Seems that in the original punctuation removal step, punctuation in numbers is intentionally kept. Perhaps due to decimal point issues or ordinal number representation in some languages.
This, however, results in extra punctuation being predicted when a number is at the end of a sentence: 'The Answer to the Ultimate Question of Life, the Universe, and Everything is 42.' becomes 'The Answer to the Ultimate Question of Life, the Universe, and Everything is 42..'
Not sure what would be an elegant solution to this. The punctuation-stripping regex can't tell apart ordinal marks from sentence-final full-stops. Would be nice to trust the LM to predict all the punctuation, i.e., remove all of it in the pre-processing step.
The text was updated successfully, but these errors were encountered: