Creating LabeledPoint
Firstly, take a glance to the sample data.
7.4;0.7;0;1.9;0.076;11;34;0.9978;3.51;0.56;9.4;5
7.8;0.88;0;2.6;0.098;25;67;0.9968;3.2;0.68;9.8;5
7.8;0.76;0.04;2.3;0.092;15;54;0.997;3.26;0.65;9.8;5
11.2;0.28;0.56;1.9;0.075;17;60;0.998;3.16;0.58;9.8;6
7.4;0.7;0;1.9;0.076;11;34;0.9978;3.51;0.56;9.4;5
7.4;0.66;0;1.8;0.075;13;40;0.9978;3.51;0.56;9.4;5
7.9;0.6;0.06;1.6;0.069;15;59;0.9964;3.3;0.46;9.4;5
Each of our training samples has the following format: the last field is the quality of the wine, a rating from 1 to 10, and the first 11 fields are the properties of the wine. So, from our perspective, the quality of the wine is the y variable (the output) and the rest of them are x variables (input).
Loading Data into Spark RDD
Import
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import org.apache.spark.sql.SQLContext
Code
val conf = new SparkConf().setAppName("linearRegressionWine").setMaster("local[2]")
val sc = new SparkContext(conf)
val rdd = sc.textFile("winequality-red.csv").map(line => line.split(";"))
Converting each instance into LabeledPoint
A LabeledPoint is a simple wrapper around the input features (our x variables) and our pre-predicted value (y variable) for these x input values:
Import
import org.apache.spark.mllib.regression.LabeledPoint
import org.apache.spark.mllib.linalg.Vectors
Code
val dataPoints = rdd.map(row =>
new LabeledPoint(
row.last.toDouble,
Vectors.dense(row.take(row.length - 1).map(str => str.toDouble))
)
).cache()
Where
LabeledPoint
is a case class, and the constructors isnew LabeledPoint(label: Double, features: Array[Double])
label Label for this data point. features List of features for this data point. Vector
is a set of numbers, and the number of coordinates determine the dimension of space.There are two types of
Vector
, one isdense
andsparse
. Adense
vector is backed by an array of its values, while asparse
vector is backed by two parallel arrays, one for indices and another for values.Example:
- Dense vector
val dvPerson = Vectors.dense(160.0,69.0,24.0)
- Sparse vector: (1.0, 0.0, 0.0, 3.0, 0.0)
The last array, must be specifiedval sparVector = Vector.sparse(5, [0,3],(1.0, 3.0))
double
as data type or use at least one decimal. - Caching: Cache the RDD into memory.
- Dense vector