-
Notifications
You must be signed in to change notification settings - Fork 68
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add NGramsTransformer, fix #52 #56
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@nevillelyh see my comments
def apply(name: String, separator: String, low: Int, high: Int): TfType = | ||
new NGramsTransformer(name, separator, low, high) | ||
|
||
def apply(name: String, high: Int): TfType = apply(name, "", 1, high) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are these reasonable defaults to have, and are they even worth having?
super.buildFeatures(a.map(ngrams(_)), c, fb) | ||
|
||
private[transformers] def ngrams(a: Seq[String]): Seq[String] = { | ||
val ngrams = for (i <- low to high) yield a.sliding(i).map(_.mkString(separator)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does it make sense to unroll this and try to optimize? Vocabulary space is obviously a problem for time complexity which we can't solve here, but could at least bring down the constant factor?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not necessarily but maybe worth making it lazy with a.toStream
etc.
|
||
property("default") = Prop.forAll { xs: List[List[String]] => | ||
val transformer = new NGramsTransformer("n_gram", " ", 2, 4) | ||
val ngrams = xs.map(transformer.ngrams(_)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't really like this test because it doesn't exercise the ngrams
function independently, which seems harder to do in a property test. Ideally I'd have a separate unit test to check that in isolation. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's OK for now. We can make assertions more black box as part of #53.
/** | ||
* Transform a collection of sentences, where each row is a `Seq[String]` of the words / tokens, | ||
* into a collection containing all the n-grams that can be constructed from each row. The feature | ||
* representation is an m-hot encoding (see [[NHotEncoder]]) constructed from an expanded vocabulary |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You mean "n-hot"?
* of all of the generated n-grams. | ||
* | ||
* N-grams are generated based on a specified range of `low` to `high` (inclusive) and are joined by | ||
* the given `separator` (the default is ""). For example, an [[NGramsTransformer]] with |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd use space " " as default, since you might run into duplicates with simple concat in typical NLP application.
* | ||
* As with [[NHotEncoder]], missing values are transformed to [0.0, 0.0, ...]. | ||
*/ | ||
object NGramsTransformer { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No other transformer actually have Transformer
in the name, just call it NGrams
?
* @param low the smallest size of the generated *-grams | ||
* @param high the largest size of the generated *-grams | ||
*/ | ||
def apply(name: String, separator: String, low: Int, high: Int): TfType = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
def apply(name: String, low: Int = 1, high: Int = -1, sep: String = " ")
? This way 1 apply
handles all. Just need to make it clear that by default it generates n-grams all the way to n = xs.length
and may be expensive.
Also not worth aliasing Transformer[A, B, C]
as TfType
if you have only one apply
. Use full type to make it clearer?
super.buildFeatures(a.map(ngrams(_)), c, fb) | ||
|
||
private[transformers] def ngrams(a: Seq[String]): Seq[String] = { | ||
val ngrams = for (i <- low to high) yield a.sliding(i).map(_.mkString(separator)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not necessarily but maybe worth making it lazy with a.toStream
etc.
|
||
property("default") = Prop.forAll { xs: List[List[String]] => | ||
val transformer = new NGramsTransformer("n_gram", " ", 2, 4) | ||
val ngrams = xs.map(transformer.ngrams(_)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's OK for now. We can make assertions more black box as part of #53.
08b1da9
to
dd02d81
Compare
@nevillelyh addressed all your comments, feel free to take another look and lmk if good to merge |
Codecov Report
@@ Coverage Diff @@
## master #56 +/- ##
=====================================
Coverage 100% 100%
=====================================
Files 36 37 +1
Lines 946 970 +24
Branches 91 81 -10
=====================================
+ Hits 946 970 +24
Continue to review full report at Codecov.
|
Tweak ngram params. Remove type alias. Use stream to compute ngrams lazily
dd02d81
to
6ec7fc4
Compare
@nevillelyh PTAL. Gonna leave some comments / questions in a few places.