Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stricter parsing, support more phrases, use doctest #7

Open
wants to merge 1 commit into
base: master
Choose a base branch
from
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
362 changes: 337 additions & 25 deletions text2num.py
Original file line number Diff line number Diff line change
Expand Up @@ -22,10 +22,252 @@
# LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
# THE SOFTWARE.
"""
Convert textual numbers written in English into their integer representations.

>>> text2num("zero")
0

>>> text2num("one")
1

>>> text2num("twelve")
12

>>> text2num("nineteen")
19

>>> text2num("twenty nine")
29

>>> text2num("seventy two")
72

>>> text2num("three hundred")
300

>>> text2num("twelve hundred")
1200

>>> text2num("nineteen hundred eighty four")
1984


Hundreds may be implied without a 'hundreds' token if no other magnitudes
are present:

>>> text2num("one thirty")
130

>>> text2num("six sixty two")
662

>>> text2num("ten twelve")
1012

>>> text2num("nineteen ten")
1910

>>> text2num("nineteen eighty four")
1984

>>> text2num("twenty ten")
2010

>>> text2num("twenty twenty")
2020

>>> text2num("twenty twenty one")
2021

>>> text2num("fifty sixty three")
5063

>>> text2num("one thirty thousand")
Traceback (most recent call last):
...
NumberException: 'thousand' may not proceed implied hundred 'one thirty'

>>> text2num("nineteen eighty thousand")
Traceback (most recent call last):
...
NumberException: 'thousand' may not proceed implied hundred
'nineteen eighty'

>>> text2num("twelve thousand three hundred four")
12304

>>> text2num("six million")
6000000

>>> text2num("six million four hundred thousand five")
6400005

>>> text2num("one hundred twenty three billion four hundred fifty six "
... "million seven hundred eighty nine thousand twelve")
123456789012

>>> text2num("four decillion")
4000000000000000000000000000000000

>>> text2num("one hundred thousand")
100000

>>> text2num("one hundred two thousand")
102000


Magnitudes must magnify a number and appear in descending order (except
for hundreds, since it can magnify other magnitudes).

>>> text2num("thousand")
Traceback (most recent call last):
...
NumberException: magnitude 'thousand' must be preceded by a number

>>> text2num("hundred one")
Traceback (most recent call last):
...
NumberException: magnitude 'hundred' must be preceded by a number

>>> text2num("one thousand thousand")
Traceback (most recent call last):
...
NumberException: magnitude 'thousand' must be preceded by a number

>>> text2num("one thousand two thousand")
Traceback (most recent call last):
...
NumberException: magnitude 'thousand' appeared out of order following
'one thousand two'

>>> text2num("one hundred two hundred")
Traceback (most recent call last):
...
NumberException: magnitude 'hundred' appeared out of order following
'one hundred two'

>>> text2num("one thousand two million")
Traceback (most recent call last):
...
NumberException: magnitude 'million' appeared out of order following
'one thousand two'

>>> text2num("nine one")
Traceback (most recent call last):
...
NumberException: 'one' may not proceed 'nine'

>>> text2num("ten two")
Traceback (most recent call last):
...
NumberException: 'two' may not proceed 'ten'

>>> text2num("nineteen nine")
Traceback (most recent call last):
...
NumberException: 'nine' may not proceed 'nineteen'

>>> text2num("sixty five hundred")
6500

>>> text2num("sixty hundred")
6000

>>> text2num("ten hundred twelve")
1012

>>> text2num("twenty twenty ten")
Traceback (most recent call last):
...
NumberException: 'ten' may not proceed 'twenty' following 'twenty'

>>> text2num("three thousand nineteen eighty four")
Traceback (most recent call last):
...
NumberException: 'eighty' may not proceed 'nineteen' following
'three thousand'

>>> text2num("three million nineteen eighty four")
Traceback (most recent call last):
...
NumberException: 'eighty' may not proceed 'nineteen' following
'three million'

>>> text2num("one million eighty eighty")
Traceback (most recent call last):
...
NumberException: 'eighty' may not proceed 'eighty' following 'one million'

>>> text2num("one million eighty one")
1000081

>>> text2num("zero zero")
Traceback (most recent call last):
...
NumberException: 'zero' may not appear with other numbers

>>> text2num("one zero")
Traceback (most recent call last):
...
NumberException: 'zero' may not appear with other numbers

>>> text2num("zero thousand")
Traceback (most recent call last):
...
NumberException: 'zero' may not appear with other numbers

>>> text2num("foo thousand")
Traceback (most recent call last):
...
NumberException: unknown number: 'foo'


Strings may optionally include the word 'and', but only in positions
that make sense:

>>> text2num("one thousand and two")
1002

>>> text2num("ten hundred and twelve")
1012

>>> text2num("nineteen hundred and eighty eight")
1988

>>> text2num("one hundred and ten thousand and one")
110001

>>> text2num("forty and two")
Traceback (most recent call last):
...
NumberException: 'and' must be preceeded by a magnitude but got 'forty'

>>> text2num("one and")
Traceback (most recent call last):
...
NumberException: 'and' must be preceeded by a magnitude but got 'one'

>>> text2num("and one")
Traceback (most recent call last):
...
NumberException: 'and' must be preceeded by a magnitude

>>> text2num("one hundred and")
Traceback (most recent call last):
...
NumberException: 'and' must be followed by a number

>>> text2num("nineteen and eighty eight")
Traceback (most recent call last):
...
NumberException: 'and' must be preceeded by a magnitude but got 'nineteen'

"""

import re

Small = {
SMALL = {
'zero': 0,
'one': 1,
'two': 2,
Expand Down Expand Up @@ -56,7 +298,8 @@
'ninety': 90
}

Magnitude = {
MAGNITUDE = {
'hundred': 100,
'thousand': 1000,
'million': 1000000,
'billion': 1000000000,
Expand All @@ -70,37 +313,106 @@
'decillion': 1000000000000000000000000000000000,
}


class NumberException(Exception):
def __init__(self, msg):
Exception.__init__(self, msg)
"""
Number parsing error.

"""
pass


def text2num(s):
a = re.split(r"[\s-]+", s)
"""
Convert the English number phrase `s` into the integer it describes.

"""
# pylint: disable=invalid-name,too-many-branches,undefined-loop-variable
words = re.split(r'[\s,-]+', s)

if not words:
raise NumberException("no numbers in string: {!r}".format(s))

n = 0
g = 0
for w in a:
x = Small.get(w, None)
implied_hundred = False

for i, word in enumerate(words):
tens = g % 100
if word == "and":
if i and tens == 0:
# If this isn't the first word, and `g` was multiplied by 100
# or reset to 0, then we're in a spot where 'and' is allowed.
continue
else:
fmt = (word, " but got {!r}".format(words[i - 1]) if i else "")
raise NumberException("{!r} must be preceeded by a magnitude"
"{}".format(*fmt))

x = SMALL.get(word, None)
if x is not None:
if x == 0 and len(words) > 1:
raise NumberException("{!r} may not appear with other "
"numbers".format(word))

if tens != 0:
# Check whether the two small numbers can be treated as if an
# implied 'hundred' is present, as in 'nineteen eighty four'.
if x >= 10:
# Only allow implied hundreds if no other magnitude is
# already present.
if n == 0:
n += g * 100
g = 0
implied_hundred = True
else:
fmt = (word, words[i - 1], " ".join(words[:i - 1]))
raise NumberException("{!r} may not proceed {!r} "
"following {!r}".format(*fmt))
# Treat sequences like 'nineteen one' as errors rather than
# interpret them as 'nineteen hundred one', 'nineteen aught
# one', 'nineteen oh one', etc. But continue if we have 20 or
# greater in the accumulator to support 'twenty one', 'twenty
# two', etc.
elif tens < 20:
raise NumberException("{!r} may not proceed "
"{!r}".format(word, words[i - 1]))

g += x
elif w == "hundred" and g != 0:
g *= 100
else:
x = Magnitude.get(w, None)
if x is not None:
x = MAGNITUDE.get(word, None)
if x is None:
raise NumberException("unknown number: {!r}".format(word))
# We could check some of these branches in the conditional one
# level up, but would prefer the 'unknown number' exception take
# precedence since it's a bigger problem.
elif implied_hundred:
fmt = (word, " ".join(words[:i]))
raise NumberException("{!r} may not proceed implied hundred "
"{!r}".format(*fmt))
# Disallow standalone magnitudes and multiple magnitudes like
# 'one thousand million' where 'one billion' should be used
# instead.
elif g == 0:
raise NumberException("magnitude {!r} must be preceded by a "
"number".format(word))
# Check whether this magnitude was preceded by a lower one.
elif 0 < n <= x or g >= x:
fmt = (word, " ".join(words[:i]))
raise NumberException("magnitude {!r} appeared out of order "
"following {!r}".format(*fmt))
# Accumulate hundreds in `g`, not `n`, since hundreds can magnify
# other magnitudes.
elif x == 100:
g *= x
else:
n += g * x
g = 0
else:
raise NumberException("Unknown number: "+w)

# We could check whether the last word is 'and' at the very beginning and
# fail early, but this way errors are raised in the order each word is
# seen, as if we're processing a stream.
if word == "and":
raise NumberException("{!r} must be followed by a number".format(word))

return n + g

if __name__ == "__main__":
assert 1 == text2num("one")
assert 12 == text2num("twelve")
assert 72 == text2num("seventy two")
assert 300 == text2num("three hundred")
assert 1200 == text2num("twelve hundred")
assert 12304 == text2num("twelve thousand three hundred four")
assert 6000000 == text2num("six million")
assert 6400005 == text2num("six million four hundred thousand five")
assert 123456789012 == text2num("one hundred twenty three billion four hundred fifty six million seven hundred eighty nine thousand twelve")
assert 4000000000000000000000000000000000 == text2num("four decillion")