-
Notifications
You must be signed in to change notification settings - Fork 1
Custom regex for words.txt
Asher
uses the nlp system called compromise
, it provides a neat way to lookup/grab words in a text, based on their parsed, interpreted representations- as opposed to just their characters.
For ease of use, it superficially resembles regex.
Results are an array of Terms
objects, which allows you to manipulate individual matches, or operate on them in bulk. Transformations to matches apply to the original terms themselves, so you can efficiently inspect, transform, then return your parsed text.
term-term matches use normalised & non-normalised text as a direct lookup:
let matches = nlp('John eats glue!').match('john eats glue').out('text')
//"John eats glue"
you can loosen a search by any matching part-of-speech, allowing you to find all the things john eats, for example:
let matches = nlp('John eats glue').match('john eats #Noun').out('text')
//"John eats glue"
let matches = nlp('John eats glue').match('john eats #Noun').out('text')
//"John eats glue"
the tags can also be optional ?
, or greedy +
nlp('he is good').match('#Adverb? good').out('text')
//'good'
nlp('he is really, really good').match('#Adverb+ good').out('text')
//'really, really good'
The .
character means 'any one term'.
let matches = nlp('John eats glue').match('john . glue').out('text')
//"John eats glue"
The *
means 'all terms until'. It may be 0.
let matches = nlp('John always ravenously eats his glue').match('john * eats').out('text')
//"John always ravenously eats"
The ?
character at the end of a word means it isn't necessary to be there.
let matches = nlp('John eats glue').match('john always? eats glue').out('text')
//"John eats glue"
let matches = nlp('John eats glue').match('john [Adverb]? eats glue').out('text')
//"John eats glue"
the +
character at the end of a tag (or .
) implies the match will continue with repeated consecutive matches:
nlp('john, david, and joe went fishing').match('#Person+ and joe').out('text')
//'john, david and joe'
(word1|word2)
parentheses allow listing possible matches for the word
let matches = nlp('John eats glue').match('john (eats|sniffs|wears) .').out('text')
//"John eats glue"
you can run a javascript regular-expression on every word in your document, if you wish, using the /myregex/
syntax.
nlp('it is raining and had rained').match('#Verb /rain[ing|ed]/').out('array')
note that this will not match multiple-word patterns, and will be slower than other lookups, like (#Verb raining|#Verb rained)
, for example.
you can find a match and return only a subset of the match, using []
brackets around any group. Using this pattern you can effectively to 'look-arounds', to add conditions to a match statement.
nlp('i saw ralf eat the glue').match('#Person [#Verb the #Noun]').out('normal')
//"eat the glue"
A leading ^
character means 'at the start of a sentence'.
let matches = nlp('John eats glue').match('^john eats ...').out('text')
//"John eats glue"
An ending $
character means 'must be at the end of the sentence'.
let matches = nlp('John eats glue').match('eats glue$').out('text')
//"eats glue"
you can specify a not-match with a !
character:
str = 'Homer Simpson and Homer Adkins'
nlp(str).match('homer !simpson').out()
//'Homer Adkins'
you can specify a max, min number of wildcard terms, like this:
str = 'homer j j j j simpson'
nlp(str).match('homer #Acronym{2,6} simpson').out()
you can look for sub-word matches, using the _
character:
var r = nlp(`it's kind of a funny story`)
r.match('_nny') //funny
r.match('fu_') //funny
r.match('_nn_') //funny
r.match('_d story').found //false