Scaling the Features

Summay Statistics

val featureVector = rdd.map(row => Vectors.dense(row.take(row.length-1).map(str => str.toDouble)))
val stats = Statistics.colStats(featureVector)
print(s"Max : ${stats.max}, Min : ${stats.min}, and Mean : ${stats.mean} and Variance : ${stats.variance}")

Output

Max : [15.9,1.58,1.0,15.5,0.611,72.0,289.0,1.00369,4.01,2.0,14.9], 
Min : [4.6,0.12,0.0,0.9,0.012,1.0,6.0,0.99007,2.74,0.33,8.4], and
Mean : [8.319637273295804,0.5278205128205128,0.2709756097560975,2.5388055034396513,0.08746654158849285,15.874921826141337,46.4677923702314,0.9967466791744846,3.3111131957473425,0.6581488430268924,10.42298311444653] and 
Variance : [3.031416388997815,0.0320623776515516,0.03794748313440582,1.987897132985963,0.002215142653300991,109.41488383305895,1082.1023725325845,3.56202945332629E-6,0.02383518054541292,0.028732616129761978,1.135647395000472]

Scaling

Making features have approximately zero mean by replacing each field x with x-m, and values within an unit standard deviation by dividing the range of feature.

Import

 import org.apache.spark.mllib.feature.StandardScaler

Code

 val scaler = new StandardScaler(withMean = true, withStd = true).fit(trainingSet.map(dp => dp.features))

Where

StandardScaler is the constructors with two parameters

Parameter	Describe	Default
withMean	Centers the data with mean before scaling. It will build a dense output, so this does not work on sparse input and will raise an exception.	false
withStd	Scales the data to unit standard deviation.	true

fit(data: RDD[Vector]): StandardScalerModel: Computes the mean and variance and stores as a model to be used for later scaling.
trainingSet is consisted of LabeledPoint instances and so dp is LabeledPoint. The method features lists the features for this data point.

Scale the training and test set.

 val scaledTrainingSet = trainingSet.map(dp => new LabeledPoint(dp.label, scaler.transform(dp.features))).cache()
 val scaledTestSet = testSet.map(dp => new LabeledPoint(dp.label, scaler.transform(dp.features))).cache()