Scaling the Features
Summay Statistics
val featureVector = rdd.map(row => Vectors.dense(row.take(row.length-1).map(str => str.toDouble)))
val stats = Statistics.colStats(featureVector)
print(s"Max : ${stats.max}, Min : ${stats.min}, and Mean : ${stats.mean} and Variance : ${stats.variance}")
Output
Max : [15.9,1.58,1.0,15.5,0.611,72.0,289.0,1.00369,4.01,2.0,14.9],
Min : [4.6,0.12,0.0,0.9,0.012,1.0,6.0,0.99007,2.74,0.33,8.4], and
Mean : [8.319637273295804,0.5278205128205128,0.2709756097560975,2.5388055034396513,0.08746654158849285,15.874921826141337,46.4677923702314,0.9967466791744846,3.3111131957473425,0.6581488430268924,10.42298311444653] and
Variance : [3.031416388997815,0.0320623776515516,0.03794748313440582,1.987897132985963,0.002215142653300991,109.41488383305895,1082.1023725325845,3.56202945332629E-6,0.02383518054541292,0.028732616129761978,1.135647395000472]
Scaling
Making features have approximately zero mean by replacing each field
x
withx-m
, and values within an unit standard deviation by dividing the range of feature.Import
import org.apache.spark.mllib.feature.StandardScaler
Code
val scaler = new StandardScaler(withMean = true, withStd = true).fit(trainingSet.map(dp => dp.features))
Where
StandardScaler
is the constructors with two parametersParameter Describe Default withMean Centers the data with mean before scaling. It will build a dense output, so this does not work on sparse input and will raise an exception. false withStd Scales the data to unit standard deviation. true fit(data: RDD[Vector]): StandardScalerModel
: Computes the mean and variance and stores as a model to be used for later scaling.trainingSet
is consisted ofLabeledPoint
instances and sodp
isLabeledPoint
. The methodfeatures
lists the features for this data point.
- Scale the training and test set.
val scaledTrainingSet = trainingSet.map(dp => new LabeledPoint(dp.label, scaler.transform(dp.features))).cache() val scaledTestSet = testSet.map(dp => new LabeledPoint(dp.label, scaler.transform(dp.features))).cache()