Creating LabeledPoint

Firstly, take a glance to the sample data.

7.4;0.7;0;1.9;0.076;11;34;0.9978;3.51;0.56;9.4;5
7.8;0.88;0;2.6;0.098;25;67;0.9968;3.2;0.68;9.8;5
7.8;0.76;0.04;2.3;0.092;15;54;0.997;3.26;0.65;9.8;5
11.2;0.28;0.56;1.9;0.075;17;60;0.998;3.16;0.58;9.8;6
7.4;0.7;0;1.9;0.076;11;34;0.9978;3.51;0.56;9.4;5
7.4;0.66;0;1.8;0.075;13;40;0.9978;3.51;0.56;9.4;5
7.9;0.6;0.06;1.6;0.069;15;59;0.9964;3.3;0.46;9.4;5

Each of our training samples has the following format: the last field is the quality of the wine, a rating from 1 to 10, and the first 11 fields are the properties of the wine. So, from our perspective, the quality of the wine is the y variable (the output) and the rest of them are x variables (input).

Loading Data into Spark RDD

Import

import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import org.apache.spark.sql.SQLContext

Code

val conf = new SparkConf().setAppName("linearRegressionWine").setMaster("local[2]")
val sc = new SparkContext(conf)
val rdd = sc.textFile("winequality-red.csv").map(line => line.split(";"))

Converting each instance into LabeledPoint

A LabeledPoint is a simple wrapper around the input features (our x variables) and our pre-predicted value (y variable) for these x input values:

Import

import org.apache.spark.mllib.regression.LabeledPoint
import org.apache.spark.mllib.linalg.Vectors

Code

val dataPoints = rdd.map(row => 
    new LabeledPoint(
          row.last.toDouble, 
          Vectors.dense(row.take(row.length - 1).map(str => str.toDouble))
    )
  ).cache()

Where

  1. LabeledPoint is a case class, and the constructors is

     new LabeledPoint(label: Double, features: Array[Double])
    
    labelLabel for this data point.
    featuresList of features for this data point.
  2. Vector is a set of numbers, and the number of coordinates determine the dimension of space.

    There are two types of Vector, one is dense and sparse. A dense vector is backed by an array of its values, while a sparse vector is backed by two parallel arrays, one for indices and another for values.

    Example:

    1. Dense vector
      val dvPerson = Vectors.dense(160.0,69.0,24.0)
      
    2. Sparse vector: (1.0, 0.0, 0.0, 3.0, 0.0)
      val sparVector = Vector.sparse(5, [0,3],(1.0, 3.0))
      
      The last array, must be specified double as data type or use at least one decimal.
    3. Caching: Cache the RDD into memory.