Name	Name	Last commit message	Last commit date
Latest commit Michel Davit Ignore javadoc warnings (#683 ) May 10, 2023 6359cc9 · May 10, 2023 History 798 Commits
.github/workflows	.github/workflows	All GitHub actions are now handled by sbt-typlevel plugin (#671 )	Jan 18, 2023
core/src	core/src	Reduce warnings (#681 )	Apr 28, 2023
docs	docs	Update build (#665 )	Jan 17, 2023
examples/src/main/scala	examples/src/main/scala	Update build (#665 )	Jan 17, 2023
flink/src	flink/src	Add scalafix (#399 )	Aug 15, 2020
java/src	java/src	Reduce warnings (#681 )	Apr 28, 2023
jmh/src/test/scala/com/spotify/featran/jmh	jmh/src/test/scala/com/spotify/featran/jmh	Reformat with scalafmt 3.0.0	Aug 23, 2021
numpy/src	numpy/src	Reduce warnings (#681 )	Apr 28, 2023
project	project	Update build (#665 )	Jan 17, 2023
scalding/src	scalding/src	Add scalafix (#399 )	Aug 15, 2020
scio/src	scio/src	Reduce warnings (#681 )	Apr 28, 2023
scripts	scripts	bump scala version for socco	Oct 15, 2019
spark/src	spark/src	Add scalafix (#399 )	Aug 15, 2020
tensorflow/src	tensorflow/src	Reduce warnings (#681 )	Apr 28, 2023
xgboost/src	xgboost/src	Reduce warnings (#681 )	Apr 28, 2023
.envrc	.envrc	Add nix-shell support (#674 )	Jan 24, 2023
.gitignore	.gitignore	Add nix-shell support (#674 )	Jan 24, 2023
.jvmopts	.jvmopts	workaround for ShutdownHookManager issue in tests	Sep 13, 2019
.scalafix.conf	.scalafix.conf	Add scalafix (#399 )	Aug 15, 2020
.scalafmt.conf	.scalafmt.conf	Remove deprecated scalafmt settings (#676 )	Feb 10, 2023
LICENSE	LICENSE	Initial commit	May 19, 2017
NOTICE	NOTICE	Initial commit	May 19, 2017
README.md	README.md	Update README ci badge (#675 )	Feb 2, 2023
build.sbt	build.sbt	Ignore javadoc warnings (#683 )	May 10, 2023
catalog-info.yaml	catalog-info.yaml	Add ownership (#646 )	Oct 3, 2022
shell.nix	shell.nix	Add nix-shell support (#674 )	Jan 24, 2023

Repository files navigation

featran

Featran, also known as Featran77 or F77 (get it?), is a Scala library for feature transformation. It aims to simplify the time consuming task of feature engineering in data science and machine learning processes. It supports various collection types for feature extraction and output formats for feature representation.

Introduction

Most feature transformation logic requires two steps, one global aggregation to summarize data followed by one element-wise mapping to transform them. For example:

Min-Max Scaler
- Aggregation: global min & max
- Mapping: scale each value to [min, max]
One-Hot Encoder
- Aggregation: distinct labels
- Mapping: convert each label to a binary vector

We can implement this in a naive way using reduce and map.

case class Point(score: Double, label: String)
val data = Seq(Point(1.0, "a"), Point(2.0, "b"), Point(3.0, "c"))

val a = data
  .map(p => (p.score, p.score, Set(p.label)))
  .reduce((x, y) => (math.min(x._1, y._1), math.max(x._2, y._2), x._3 ++ y._3))

val features = data.map { p =>
  (p.score - a._1) / (a._2 - a._1) :: a._3.toList.sorted.map(s => if (s == p.label) 1.0 else 0.0)
}

But this is unmanageable for complex feature sets. The above logic can be easily expressed in Featran.

import com.spotify.featran._
import com.spotify.featran.transformers._

val fs = FeatureSpec.of[Point]
  .required(_.score)(MinMaxScaler("min-max"))
  .required(_.label)(OneHotEncoder("one-hot"))

val fe = fs.extract(data)
val names = fe.featureNames
val features = fe.featureValues[Seq[Double]]

Featran also supports these additional features.

Extract from Scala collections, Flink DataSets, Scalding TypedPipes, Scio SCollections and Spark RDDs
Output as Scala collections, Breeze dense and sparse vectors, TensorFlow Example Protobuf, XGBoost LabeledPoint and NumPy .npy file
Import aggregation from a previous extraction for training, validation and test sets
Compose feature specifications and separate outputs

See Examples (source) for detailed examples. See transformers package for a complete list of available feature transformers.

See ScalaDocs for current API documentation.

Presentations

Featran - Type safe and generic feature transformation in Scala - NABD Conf Palo Alto 2017 talk

Artifacts

Feature includes the following artifacts:

featran-core - core library, support for extraction from Scala collections and output as Scala collections, Breeze dense and sparse vectors
featran-java - Java interface, see JavaExample.java
featran-flink - support for extraction from Flink DataSet
featran-scalding - support for extraction from Scalding TypedPipe
featran-scio - support for extraction from Scio SCollection
featran-spark - support for extraction from Spark RDD
featran-tensorflow - support for output as TensorFlow Example Protobuf
featran-xgboost - support for output as XGBoost LabeledPoint
featran-numpy - support for output as NumPy .npy file

License

Licensed under the Apache License, Version 2.0: http://www.apache.org/licenses/LICENSE-2.0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

featran

Introduction

Presentations

Artifacts

License

About

Releases 39

Packages

Contributors 22

Languages

License

spotify/featran

Folders and files

Latest commit

History

Repository files navigation

featran

Introduction

Presentations

Artifacts

License

About

Topics

Resources

License

Code of conduct

Stars

Watchers

Forks

Releases 39

Packages 0

Contributors 22

Languages

Packages