# ml-featurizer

**Repository Path**: mirrors_adobe/ml-featurizer

## Basic Information

- **Project Name**: ml-featurizer
- **Description**: ML Featurizer is a library to enable users to create additional features from raw data with ease
- **Primary Language**: Unknown
- **License**: Apache-2.0
- **Default Branch**: master
- **Homepage**: None
- **GVP Project**: No

## Statistics

- **Stars**: 0
- **Forks**: 0
- **Created**: 2020-09-24
- **Last Updated**: 2026-05-23

## Categories & Tags

**Categories**: Uncategorized

**Tags**: None

## README

# ML Featurizer
Feature engineering is a difficult and time consuming process. ML Featurizer is a library
to enable users to create additional features from raw data with ease. 
It extends and enriches the existing [Spark's feature engineering functionality](https://spark.apache.org/docs/latest/ml-features.html).

#### Featurizers provided by the library

  1. Unary Temporal Featurizers
      * DayOfWeekFeaturizer
      * HourOfDayFeaturizer
      * MonthOfYearFeaturizer
      * PartsOfDayFeaturizer
      * WeekendFeaturizer
  2. Unary Numeric Featurizers
      * LogTransformFeaturizer
      * MathFeaturizer
      * PowerTransformFeaturizer
  3. Binary Temporal Featurizers
      * DateDiffFeaturizer
  4. Binary Numeric Featurizers
      * AdditionFeaturizer
      * DivisionFeaturizer
      * MultiplicationFeaturizer
      * SubtractionFeaturizer
  5. Binary String Featurizers
      * ConcateColumnsFeaturizer
  6. Grouping Featurizers
      * GroupByFeaturizer (count, ratio, min, max, count, avg, sum)
  7. GEO Featurizers
      * GeohashFeaturizer (convert latitude and longitude into geohash)
      
#### Examples:
##### Create day of week feature
```scala
object DayOfWeekFeaturizerExample {
  def main(args: Array[String]): Unit = {
    val spark = SparkSession.builder().appName("DayOfWeekFeaturizer").master("local").getOrCreate()

    val data = Array((0, "2018-01-02"),
      (1, "2018-02-02"),
      (2, "2018-03-02"),
      (3, "2018-04-05"),
      (3, "2018-05-05"))
    val dataFrame = spark.createDataFrame(data).toDF("id", "date")

    val featurizer = new DayOfWeekFeaturizer()
      .setInputCol("date")
      .setOutputCol("dayOfWeek")
      .setFormat("yyyy-MM-dd")

    val featurizedDataFrame = featurizer.transform(dataFrame)
    featurizedDataFrame.show()
  }
}

```      
##### Use featurizers in Spark ML Pipeline

```scala
object FeaturePipeline {
  def main(args: Array[String]): Unit = {
    val spark = SparkSession.builder().appName("FeaturePipeline").master("local").getOrCreate()

    val data = Array((0, "2018-01-02", 1.0, 2.0, "mercedes"),
      (1, "2018-02-02", 2.5, 3.5, "lexus"),
      (2, "2018-03-02", 5.0, 1.0, "toyota"),
      (3, "2018-04-05", 8.0, 9.0, "tesla"),
      (4, "2018-05-05", 1.0, 5.0, "bmw"),
      (4, "2018-05-05", 1.0, 5.0, "bmw"))
    val dataFrame = spark.createDataFrame(data).toDF("id", "date", "price1", "price2", "brand")

    val dayOfWeekfeaturizer = new DayOfWeekFeaturizer()
      .setInputCol("date")
      .setOutputCol("dayOfWeek")
      .setFormat("yyyy-MM-dd")

    val monthOfYearfeaturizer = new MonthOfYearFeaturizer()
      .setInputCol("date")
      .setOutputCol("monthOfYear")
      .setFormat("yyyy-MM-dd")

    val weekendFeaturizer = new WeekendFeaturizer()
      .setInputCol("date")
      .setOutputCol("isWeekend")
      .setFormat("yyyy-MM-dd")

    val additionFeaturizer = new AdditionFeaturizer()
      .setInputCols("price1", "price2")
      .setOutputCol("price1_add_price2")

    val indexer = new StringIndexer()
      .setInputCol("brand")
      .setOutputCol("brandIndex")

    val encoder = new OneHotEncoder()
      .setInputCol("brandIndex")
      .setOutputCol("brandVector")

    val pipeline = new Pipeline()
      .setStages(Array(dayOfWeekfeaturizer, monthOfYearfeaturizer, weekendFeaturizer, additionFeaturizer,
        indexer, encoder))
    val model = pipeline.fit(dataFrame)
    model.transform(dataFrame).show()
  }
}
```
#### References:
* [An Empirical Analysis of Feature Engineering for
 Predictive Modeling](https://arxiv.org/pdf/1701.07852.pdf)      

### Contributing

If you're interested in contributing to this project, check out our [contribution guidelines](CONTRIBUTING.md)!