Scaling the Features

Summay Statistics

val featureVector = rdd.map(row => Vectors.dense(row.take(row.length-1).map(str => str.toDouble)))
val stats = Statistics.colStats(featureVector)
print(s"Max : ${stats.max}, Min : ${stats.min}, and Mean : ${stats.mean} and Variance : ${stats.variance}")

Output

Max : [15.9,1.58,1.0,15.5,0.611,72.0,289.0,1.00369,4.01,2.0,14.9], 
Min : [4.6,0.12,0.0,0.9,0.012,1.0,6.0,0.99007,2.74,0.33,8.4], and
Mean : [8.319637273295804,0.5278205128205128,0.2709756097560975,2.5388055034396513,0.08746654158849285,15.874921826141337,46.4677923702314,0.9967466791744846,3.3111131957473425,0.6581488430268924,10.42298311444653] and 
Variance : [3.031416388997815,0.0320623776515516,0.03794748313440582,1.987897132985963,0.002215142653300991,109.41488383305895,1082.1023725325845,3.56202945332629E-6,0.02383518054541292,0.028732616129761978,1.135647395000472]

Scaling

  1. Making features have approximately zero mean by replacing each field x with x-m, and values within an unit standard deviation by dividing the range of feature.

    Import

     import org.apache.spark.mllib.feature.StandardScaler
    

    Code

     val scaler = new StandardScaler(withMean = true, withStd = true).fit(trainingSet.map(dp => dp.features))
    

    Where

    1. StandardScaler is the constructors with two parameters
      Parameter Describe Default
      withMeanCenters the data with mean before scaling. It will build a dense output, so this does not work on sparse input and will raise an exception.false
      withStdScales the data to unit standard deviation. true
    2. fit(data: RDD[Vector]): StandardScalerModel: Computes the mean and variance and stores as a model to be used for later scaling.
    3. trainingSet is consisted of LabeledPoint instances and so dp is LabeledPoint. The method features lists the features for this data point.
  2. Scale the training and test set.
     val scaledTrainingSet = trainingSet.map(dp => new LabeledPoint(dp.label, scaler.transform(dp.features))).cache()
     val scaledTestSet = testSet.map(dp => new LabeledPoint(dp.label, scaler.transform(dp.features))).cache()