Creating LabeledPoint

Firstly, take a glance to the sample data.

7.4;0.7;0;1.9;0.076;11;34;0.9978;3.51;0.56;9.4;5
7.8;0.88;0;2.6;0.098;25;67;0.9968;3.2;0.68;9.8;5
7.8;0.76;0.04;2.3;0.092;15;54;0.997;3.26;0.65;9.8;5
11.2;0.28;0.56;1.9;0.075;17;60;0.998;3.16;0.58;9.8;6
7.4;0.7;0;1.9;0.076;11;34;0.9978;3.51;0.56;9.4;5
7.4;0.66;0;1.8;0.075;13;40;0.9978;3.51;0.56;9.4;5
7.9;0.6;0.06;1.6;0.069;15;59;0.9964;3.3;0.46;9.4;5

Each of our training samples has the following format: the last field is the quality of the wine, a rating from 1 to 10, and the first 11 fields are the properties of the wine. So, from our perspective, the quality of the wine is the y variable (the output) and the rest of them are x variables (input).

Loading Data into Spark RDD

Import

import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import org.apache.spark.sql.SQLContext

Code

val conf = new SparkConf().setAppName("linearRegressionWine").setMaster("local[2]")
val sc = new SparkContext(conf)
val rdd = sc.textFile("winequality-red.csv").map(line => line.split(";"))

Converting each instance into LabeledPoint

A LabeledPoint is a simple wrapper around the input features (our x variables) and our pre-predicted value (y variable) for these x input values:

Import

import org.apache.spark.mllib.regression.LabeledPoint
import org.apache.spark.mllib.linalg.Vectors

Code

val dataPoints = rdd.map(row => 
    new LabeledPoint(
          row.last.toDouble, 
          Vectors.dense(row.take(row.length - 1).map(str => str.toDouble))
    )
  ).cache()

Where

LabeledPoint is a case class, and the constructors is
```
 new LabeledPoint(label: Double, features: Array[Double])
```
label Label for this data point.

features List of features for this data point.
Vector is a set of numbers, and the number of coordinates determine the dimension of space.

There are two types of Vector, one is dense and sparse. A dense vector is backed by an array of its values, while a sparse vector is backed by two parallel arrays, one for indices and another for values.

Example:
1. Dense vector
```
val dvPerson = Vectors.dense(160.0,69.0,24.0)
```
2. Sparse vector: (1.0, 0.0, 0.0, 3.0, 0.0)
```
val sparVector = Vector.sparse(5, [0,3],(1.0, 3.0))
```
  The last array, must be specified double as data type or use at least one decimal.
3. Caching: Cache the RDD into memory.

label	Label for this data point.
features	List of features for this data point.