Add NGramsTransformer, fix #52 #56

andrewsmartin · 2017-10-17T00:21:32Z

@nevillelyh PTAL. Gonna leave some comments / questions in a few places.

andrewsmartin

andrewsmartin · 2017-10-17T00:22:00Z

core/src/main/scala/com/spotify/featran/transformers/NGramsTransformer.scala

+  def apply(name: String, separator: String, low: Int, high: Int): TfType =
+    new NGramsTransformer(name, separator, low, high)
+
+  def apply(name: String, high: Int): TfType = apply(name, "", 1, high)


Are these reasonable defaults to have, and are they even worth having?

andrewsmartin · 2017-10-17T00:23:09Z

core/src/main/scala/com/spotify/featran/transformers/NGramsTransformer.scala

+    super.buildFeatures(a.map(ngrams(_)), c, fb)
+
+  private[transformers] def ngrams(a: Seq[String]): Seq[String] = {
+    val ngrams = for (i <- low to high) yield a.sliding(i).map(_.mkString(separator))


Does it make sense to unroll this and try to optimize? Vocabulary space is obviously a problem for time complexity which we can't solve here, but could at least bring down the constant factor?

Not necessarily but maybe worth making it lazy with a.toStream etc.

andrewsmartin · 2017-10-17T00:24:07Z

core/src/test/scala/com/spotify/featran/transformers/NGramsTransformerSpec.scala

+
+  property("default") = Prop.forAll { xs: List[List[String]] =>
+    val transformer = new NGramsTransformer("n_gram", " ", 2, 4)
+    val ngrams = xs.map(transformer.ngrams(_))


I don't really like this test because it doesn't exercise the ngrams function independently, which seems harder to do in a property test. Ideally I'd have a separate unit test to check that in isolation. What do you think?

It's OK for now. We can make assertions more black box as part of #53.

nevillelyh · 2017-10-17T08:17:50Z

core/src/main/scala/com/spotify/featran/transformers/NGramsTransformer.scala

+/**
+ * Transform a collection of sentences, where each row is a `Seq[String]` of the words / tokens,
+ * into a collection containing all the n-grams that can be constructed from each row. The feature
+ * representation is an m-hot encoding (see [[NHotEncoder]]) constructed from an expanded vocabulary


You mean "n-hot"?

nevillelyh · 2017-10-17T08:18:32Z

core/src/main/scala/com/spotify/featran/transformers/NGramsTransformer.scala

+ * of all of the generated n-grams.
+ *
+ * N-grams are generated based on a specified range of `low` to `high` (inclusive) and are joined by
+ * the given `separator` (the default is ""). For example, an [[NGramsTransformer]] with


I'd use space " " as default, since you might run into duplicates with simple concat in typical NLP application.

nevillelyh · 2017-10-17T08:19:12Z

core/src/main/scala/com/spotify/featran/transformers/NGramsTransformer.scala

+ *
+ * As with [[NHotEncoder]], missing values are transformed to [0.0, 0.0, ...].
+ */
+object NGramsTransformer {


No other transformer actually have Transformer in the name, just call it NGrams?

nevillelyh · 2017-10-17T08:21:55Z

core/src/main/scala/com/spotify/featran/transformers/NGramsTransformer.scala

+   * @param low the smallest size of the generated *-grams
+   * @param high the largest size of the generated *-grams
+   */
+  def apply(name: String, separator: String, low: Int, high: Int): TfType =


def apply(name: String, low: Int = 1, high: Int = -1, sep: String = " ")? This way 1 apply handles all. Just need to make it clear that by default it generates n-grams all the way to n = xs.length and may be expensive.

Also not worth aliasing Transformer[A, B, C] as TfType if you have only one apply. Use full type to make it clearer?

nevillelyh · 2017-10-17T08:24:00Z

core/src/main/scala/com/spotify/featran/transformers/NGramsTransformer.scala

+    super.buildFeatures(a.map(ngrams(_)), c, fb)
+
+  private[transformers] def ngrams(a: Seq[String]): Seq[String] = {
+    val ngrams = for (i <- low to high) yield a.sliding(i).map(_.mkString(separator))


Not necessarily but maybe worth making it lazy with a.toStream etc.

nevillelyh · 2017-10-17T08:24:50Z

core/src/test/scala/com/spotify/featran/transformers/NGramsTransformerSpec.scala

+
+  property("default") = Prop.forAll { xs: List[List[String]] =>
+    val transformer = new NGramsTransformer("n_gram", " ", 2, 4)
+    val ngrams = xs.map(transformer.ngrams(_))


It's OK for now. We can make assertions more black box as part of #53.

andrewsmartin · 2017-10-17T16:36:38Z

@nevillelyh addressed all your comments, feel free to take another look and lmk if good to merge

codecov-io · 2017-10-17T16:43:53Z

Codecov Report

Merging #56 into master will not change coverage.
The diff coverage is 100%.

@@          Coverage Diff          @@
##           master    #56   +/-   ##
=====================================
  Coverage     100%   100%           
=====================================
  Files          36     37    +1     
  Lines         946    970   +24     
  Branches       91     81   -10     
=====================================
+ Hits          946    970   +24

Impacted Files	Coverage Δ
...cala/com/spotify/featran/transformers/NGrams.scala	`100% <100%> (ø)`
...ify/featran/transformers/PolynomialExpansion.scala	`100% <0%> (ø)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update bb87075...6ec7fc4. Read the comment docs.

Tweak ngram params. Remove type alias. Use stream to compute ngrams lazily

andrewsmartin commented Oct 17, 2017

View reviewed changes

nevillelyh reviewed Oct 17, 2017

View reviewed changes

andrewsmartin force-pushed the andrew/ngrams branch from 08b1da9 to dd02d81 Compare October 17, 2017 16:30

Add NGramsTransformer, fix #52

6ec7fc4

Tweak ngram params. Remove type alias. Use stream to compute ngrams lazily

andrewsmartin force-pushed the andrew/ngrams branch from dd02d81 to 6ec7fc4 Compare October 17, 2017 19:42

nevillelyh merged commit 6c00292 into master Oct 17, 2017

nevillelyh deleted the andrew/ngrams branch October 17, 2017 20:11

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add NGramsTransformer, fix #52 #56

Add NGramsTransformer, fix #52 #56

andrewsmartin commented Oct 17, 2017

andrewsmartin left a comment

andrewsmartin Oct 17, 2017

andrewsmartin Oct 17, 2017 •

edited

Loading

nevillelyh Oct 17, 2017

andrewsmartin Oct 17, 2017

nevillelyh Oct 17, 2017

nevillelyh Oct 17, 2017

nevillelyh Oct 17, 2017

nevillelyh Oct 17, 2017

nevillelyh Oct 17, 2017 •

edited

Loading

nevillelyh Oct 17, 2017

nevillelyh Oct 17, 2017

andrewsmartin commented Oct 17, 2017

codecov-io commented Oct 17, 2017 •

edited

Loading

Add NGramsTransformer, fix #52 #56

Add NGramsTransformer, fix #52 #56

Conversation

andrewsmartin commented Oct 17, 2017

andrewsmartin left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

andrewsmartin Oct 17, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nevillelyh Oct 17, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

andrewsmartin commented Oct 17, 2017

codecov-io commented Oct 17, 2017 • edited Loading

Codecov Report

andrewsmartin Oct 17, 2017 •

edited

Loading

nevillelyh Oct 17, 2017 •

edited

Loading

codecov-io commented Oct 17, 2017 •

edited

Loading