Using Maven to build Spark Job in Scala

Prerequired

Scala IDE.

Preinstalled Spark for cluster or stand alone mode.

Create Project and Maven Setup.

  1. Create an Maven Project.
  2. At the step of Select an Archtype, choose maven-archtype-quickstart.
  3. Fill in the Group ID and Artifated ID as follows, and press Finish.

    New Project Arch

  4. Scala IDE will help you to setup the Maven project. Initially the directories tree would looks like this...

  5. On the Scala IDE, add two folders named scala under /src/main and /src/test respectively. The /src/main/scala is the main location we place the Scala source code and package.
  6. Add these two directories /src/main/scala and /src/test/scala to the Source Folder by right click on the /src/main/scala folder, find Build Path then select Use as Source Folder. After you finish this step, you will have your project looks like this...

  7. Edit the pom.xml file.

    1. Open the file pom.xml, and select the pom.xml tag.

      pom view

    2. The file pom.xml is used to manage the libraries and plugins of project.

      <project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
      xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
      <modelVersion>4.0.0</modelVersion>
      
       <groupId>idv.jiaming</groupId>
       <artifactId>spark-scala-demo</artifactId>
       <version>0.0.1-SNAPSHOT</version>
       <packaging>jar</packaging>
      
       <name>spark-scala-demo</name>
       <url>http://maven.apache.org</url>
      
       <properties>
           <project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
       </properties>
      
       <dependencies>
           <dependency>
               <groupId>junit</groupId>
               <artifactId>junit</artifactId>
               <version>3.8.1</version>
               <scope>test</scope>
           </dependency>
       </dependencies>
      </project>
      

      The tag dependencies shows all the libraries you included into the project, Maven would download all those libraries and their dependencies automatically. In order to make the Maven to ompile and build the scala project, we have to add two plugins.

    3. To add the Spark as dependency, insert the following dependency matadata to dependencies section.
      <dependency>
       <groupId>org.apache.spark</groupId>
       <artifactId>spark-core_2.11</artifactId>
       <version>1.5.2</version>
      </dependency>
      
      Note that, for the reason of convenience, I specified the Spark version as 1.5.2. But the version should be same with your pre-installed Spark version.
    4. Before adding plugins, insert a new section <build></build> in the <project> section.
      <project xmlns=...>
      ...
       <dependencies>
         <dependency>
         ...
         </dependency>
       </dependencies>
       <build>
       </build>
      </project>
      
    5. Adding two plugins in the <build> section
      <build>
        <plugins>
           <plugin>
           ...
           </plugin>
           <plugin>
           ...
           </plugin>
        </plugins>
      </build>
      
    6. Setitng the first plugin as following.
       <plugin>
           <groupId>net.alchim31.maven</groupId>
           <artifactId>scala-maven-plugin</artifactId>
           <version>3.1.3</version>
           <executions>
               <execution>
                   <goals>
                       <goal>compile</goal>
                       <goal>testCompile</goal>
                   </goals>
                   <configuration>
                       <args>
                           <arg>-make:transitive</arg>
                           <arg>-dependencyfile</arg>
                           <arg>${project.build.directory}/.scala_dependencies</arg>
                       </args>
                   </configuration>
               </execution>
           </executions>
       </plugin>
      
    7. And the second plugin
       <plugin>
           <artifactId>maven-assembly-plugin</artifactId>
           <version>2.4.1</version>
           <configuration>
               <descriptorRefs>
                   <descriptorRef>jar-with-dependencies</descriptorRef>
               </descriptorRefs>
           </configuration>
           <executions>
               <execution>
                   <id>make-assembly</id>
                   <phase>package</phase>
                   <goals>
                       <goal>single</goal>
                   </goals>
               </execution>
           </executions>
       </plugin>
      
    8. To specify the source code path by adding the <sourceDirectory> tag.

      <build>
       <sourceDirectory>src/main/scala</sourceDirectory>
       <plugins>
       ...
       </plugins>
      </build>
      

      The value of <sourceDirectory>, which should be same with the source folder src/main/scala we have just setup above. And note that, if setup with different source folder path, you should specify the value to your own path instead.

    9. (Optional) In the real project, it is common to contain the unit test code. For example, in this setup tutorial, we create a directory src/main/test and set it as source folder. Also we have to specify the path in pom.xml.

      <build>
       <sourceDirectory>src/main/scala</sourceDirectory>
       <testSourceDirectory>src/test/scala</testSourceDirectory>
       <plugins>
       ...
       </plugins>
      </build>
      
    10. Saving the change, Maven would start to pull all the dependencies and plugins from internet and build the project. This step takes times depende on the network bandwidth.
  8. After the dependencies pulling and project building finished, we are ready the write the first Spark in Scala program.

First Spark Application in Scala.

Now let's create an Spark application in Scala to estimate the value of Pi.

  1. Creating a new package in /src/main/scala and specifying a package name. new package

    The first one step is right click on the source folder /src/main/scala

  2. Suppose that, I use the name spark.practice1 as my package name. We will see one empty package spark.practice1 in Eclipse Project Explore.

    new scala package

  3. Creating a new Scala Object in package spark.practice1 which is named PiValue

    new object

    new object

    Analogues to the create package step, the first step is right click on package.

  4. The initial content of Scala Object.

     package spark.practice1
    
     object PiValue {
    
     }
    
  5. Writing an application to estimate Pi, copy and paste the following list.

     package spark.practice1
    
     import org.apache.spark.SparkConf
     import org.apache.spark.SparkContext
    
     object PiValue {
    
     def main(arg: Array[String]) {
    
         val conf = new SparkConf().setAppName("PiValueEstimate").setMaster("local[2]")
         val sc = new SparkContext(conf)    
         val NUM_SAMPLES = Integer.parseInt(arg(0))
    
         val count = sc.parallelize(1 to NUM_SAMPLES).map { i =>
             val x = Math.random()
             val y = Math.random()
             if (x * x + y * y < 1) 1 else 0
         }.reduce(_ + _)
         println("Pi is roughly " + 4.0 * count / NUM_SAMPLES)
       }
     }
    
  6. Right click on the project name, and select Run As... on the menu, then select Run Configurations... to open a configure wizard.

  7. The configuration wizard.

    1. New a Maven Build configuration.
    2. Specify the configuration name.
    3. Specify the project path, you can click the button Browse Workspace to list all the available projects.
    4. Specify the Maven functions. Now, we just fill clean package into the text box.
    5. Click the Apply button to save modifications.
    6. Click the Run button to run Maven Build.
  8. The Maven Build messages would show in the Console, the first build would take more times since Maven has to grab dependencies on Internet.

  9. After you see the success message
    [INFO] --- maven-assembly-plugin:2.4.1:single (make-assembly) @ spark-scala-demo ---
    [INFO] Building jar: D:\GitHome\scala\spark-scala-demo\target\spark-scala-demo-0.0.1-SNAPSHOT-jar-with-dependencies.jar
    [INFO] ------------------------------------------------------------------------
    [INFO] BUILD SUCCESS
    [INFO] ------------------------------------------------------------------------
    [INFO] Total time: 01:00 min
    [INFO] Finished at: 2016-01-07T22:44:43+08:00
    [INFO] Final Memory: 72M/619M
    [INFO] ------------------------------------------------------------------------
    
    The Maven will tell where to find the executable jar file, In this example, the jar file is located at
     D:\GitHome\scala\spark-scala-demo\target\spark-scala-demo-0.0.1-SNAPSHOT-jar-with-dependencies.jar
    
  10. Open your terminal with the Spark Client installed, and using command to submit Spark Job

    # spark-submit spark-scala-demo-0.0.1-SNAPSHOT-jar-with-dependencies.jar spark.practice1.PiValue 10000
    

    Where spark.practice1.PiValue specifies the path of main function, 10000 specifies the first input parameter of main function and it is also the samples number in the Pi calculations.

    Note that, we don't have to specify the execution mode since we have already specified in the application as local[2].

     val conf = new SparkConf().setAppName("PiValueEstimate").setMaster("local[2]")