Creating LabeledPoint
Firstly, take a glance to the sample data.
7.4;0.7;0;1.9;0.076;11;34;0.9978;3.51;0.56;9.4;5
7.8;0.88;0;2.6;0.098;25;67;0.9968;3.2;0.68;9.8;5
7.8;0.76;0.04;2.3;0.092;15;54;0.997;3.26;0.65;9.8;5
11.2;0.28;0.56;1.9;0.075;17;60;0.998;3.16;0.58;9.8;6
7.4;0.7;0;1.9;0.076;11;34;0.9978;3.51;0.56;9.4;5
7.4;0.66;0;1.8;0.075;13;40;0.9978;3.51;0.56;9.4;5
7.9;0.6;0.06;1.6;0.069;15;59;0.9964;3.3;0.46;9.4;5
Each of our training samples has the following format: the last field is the quality of the wine, a rating from 1 to 10, and the first 11 fields are the properties of the wine. So, from our perspective, the quality of the wine is the y variable (the output) and the rest of them are x variables (input).
Loading Data into Spark RDD
Import
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import org.apache.spark.sql.SQLContext
Code
val conf = new SparkConf().setAppName("linearRegressionWine").setMaster("local[2]")
val sc = new SparkContext(conf)
val rdd = sc.textFile("winequality-red.csv").map(line => line.split(";"))
Converting each instance into LabeledPoint
A LabeledPoint is a simple wrapper around the input features (our x variables) and our pre-predicted value (y variable) for these x input values:
Import
import org.apache.spark.mllib.regression.LabeledPoint
import org.apache.spark.mllib.linalg.Vectors
Code
val dataPoints = rdd.map(row =>
new LabeledPoint(
row.last.toDouble,
Vectors.dense(row.take(row.length - 1).map(str => str.toDouble))
)
).cache()
Where
LabeledPointis a case class, and the constructors isnew LabeledPoint(label: Double, features: Array[Double])label Label for this data point. features List of features for this data point. Vectoris a set of numbers, and the number of coordinates determine the dimension of space.There are two types of
Vector, one isdenseandsparse. Adensevector is backed by an array of its values, while asparsevector is backed by two parallel arrays, one for indices and another for values.Example:
- Dense vector
val dvPerson = Vectors.dense(160.0,69.0,24.0) - Sparse vector: (1.0, 0.0, 0.0, 3.0, 0.0)
The last array, must be specifiedval sparVector = Vector.sparse(5, [0,3],(1.0, 3.0))doubleas data type or use at least one decimal. - Caching: Cache the RDD into memory.
- Dense vector