Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add NGramsTransformer, fix #52 #56

Merged
merged 1 commit into from
Oct 17, 2017
Merged

Add NGramsTransformer, fix #52 #56

merged 1 commit into from
Oct 17, 2017

Conversation

andrewsmartin
Copy link
Contributor

@nevillelyh PTAL. Gonna leave some comments / questions in a few places.

Copy link
Contributor Author

@andrewsmartin andrewsmartin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nevillelyh see my comments

def apply(name: String, separator: String, low: Int, high: Int): TfType =
new NGramsTransformer(name, separator, low, high)

def apply(name: String, high: Int): TfType = apply(name, "", 1, high)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are these reasonable defaults to have, and are they even worth having?

super.buildFeatures(a.map(ngrams(_)), c, fb)

private[transformers] def ngrams(a: Seq[String]): Seq[String] = {
val ngrams = for (i <- low to high) yield a.sliding(i).map(_.mkString(separator))
Copy link
Contributor Author

@andrewsmartin andrewsmartin Oct 17, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it make sense to unroll this and try to optimize? Vocabulary space is obviously a problem for time complexity which we can't solve here, but could at least bring down the constant factor?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not necessarily but maybe worth making it lazy with a.toStream etc.


property("default") = Prop.forAll { xs: List[List[String]] =>
val transformer = new NGramsTransformer("n_gram", " ", 2, 4)
val ngrams = xs.map(transformer.ngrams(_))
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't really like this test because it doesn't exercise the ngrams function independently, which seems harder to do in a property test. Ideally I'd have a separate unit test to check that in isolation. What do you think?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's OK for now. We can make assertions more black box as part of #53.

/**
* Transform a collection of sentences, where each row is a `Seq[String]` of the words / tokens,
* into a collection containing all the n-grams that can be constructed from each row. The feature
* representation is an m-hot encoding (see [[NHotEncoder]]) constructed from an expanded vocabulary
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You mean "n-hot"?

* of all of the generated n-grams.
*
* N-grams are generated based on a specified range of `low` to `high` (inclusive) and are joined by
* the given `separator` (the default is ""). For example, an [[NGramsTransformer]] with
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd use space " " as default, since you might run into duplicates with simple concat in typical NLP application.

*
* As with [[NHotEncoder]], missing values are transformed to [0.0, 0.0, ...].
*/
object NGramsTransformer {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No other transformer actually have Transformer in the name, just call it NGrams?

* @param low the smallest size of the generated *-grams
* @param high the largest size of the generated *-grams
*/
def apply(name: String, separator: String, low: Int, high: Int): TfType =
Copy link
Contributor

@nevillelyh nevillelyh Oct 17, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

def apply(name: String, low: Int = 1, high: Int = -1, sep: String = " ")? This way 1 apply handles all. Just need to make it clear that by default it generates n-grams all the way to n = xs.length and may be expensive.

Also not worth aliasing Transformer[A, B, C] as TfType if you have only one apply. Use full type to make it clearer?

super.buildFeatures(a.map(ngrams(_)), c, fb)

private[transformers] def ngrams(a: Seq[String]): Seq[String] = {
val ngrams = for (i <- low to high) yield a.sliding(i).map(_.mkString(separator))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not necessarily but maybe worth making it lazy with a.toStream etc.


property("default") = Prop.forAll { xs: List[List[String]] =>
val transformer = new NGramsTransformer("n_gram", " ", 2, 4)
val ngrams = xs.map(transformer.ngrams(_))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's OK for now. We can make assertions more black box as part of #53.

@andrewsmartin
Copy link
Contributor Author

@nevillelyh addressed all your comments, feel free to take another look and lmk if good to merge

@codecov-io
Copy link

codecov-io commented Oct 17, 2017

Codecov Report

Merging #56 into master will not change coverage.
The diff coverage is 100%.

Impacted file tree graph

@@          Coverage Diff          @@
##           master    #56   +/-   ##
=====================================
  Coverage     100%   100%           
=====================================
  Files          36     37    +1     
  Lines         946    970   +24     
  Branches       91     81   -10     
=====================================
+ Hits          946    970   +24
Impacted Files Coverage Δ
...cala/com/spotify/featran/transformers/NGrams.scala 100% <100%> (ø)
...ify/featran/transformers/PolynomialExpansion.scala 100% <0%> (ø) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update bb87075...6ec7fc4. Read the comment docs.

Tweak ngram params.

Remove type alias.

Use stream to compute ngrams lazily
@nevillelyh nevillelyh merged commit 6c00292 into master Oct 17, 2017
@nevillelyh nevillelyh deleted the andrew/ngrams branch October 17, 2017 20:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants