Full stops after numbers unnoticed, extra ones predicted #9

alexdiment · 2022-12-15T13:19:16Z

Hi and thanks a lot for the great tool!

Seems that in the original punctuation removal step, punctuation in numbers is intentionally kept. Perhaps due to decimal point issues or ordinal number representation in some languages.

This, however, results in extra punctuation being predicted when a number is at the end of a sentence: 'The Answer to the Ultimate Question of Life, the Universe, and Everything is 42.' becomes 'The Answer to the Ultimate Question of Life, the Universe, and Everything is 42..'

Not sure what would be an elegant solution to this. The punctuation-stripping regex can't tell apart ordinal marks from sentence-final full-stops. Would be nice to trust the LM to predict all the punctuation, i.e., remove all of it in the pre-processing step.

oliverguhr · 2023-03-10T09:23:54Z

Good catch @alexdiment.

The issue is, that the model cannot tell if 123 should be 1.23, 12.3 or 123. I wanted to avoid the case where the model messes with decimal points.
I would suggest a post-processing step, that ignores punctuation markers from the model if they are already present in the text.

It's a rather small improvement, but I have no time to implement it any time soon. So if someone could help out, it would greatly be appreciated.

oliverguhr added enhancement New feature or request help wanted Extra attention is needed labels Mar 10, 2023

oliverguhr added the good first issue Good for newcomers label Mar 10, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Full stops after numbers unnoticed, extra ones predicted #9

Full stops after numbers unnoticed, extra ones predicted #9

alexdiment commented Dec 15, 2022

oliverguhr commented Mar 10, 2023 •

edited

Loading

Full stops after numbers unnoticed, extra ones predicted #9

Full stops after numbers unnoticed, extra ones predicted #9

Comments

alexdiment commented Dec 15, 2022

oliverguhr commented Mar 10, 2023 • edited Loading

oliverguhr commented Mar 10, 2023 •

edited

Loading