Spark, started at UC Berkeley AMPLab in 2009,
is a fast and general cluster computing system for Big Data.
Spark source contains a shell build/mvn
which automatically download compatible version of scala and mvn
and compile the source code1.
git clone https://github.com/apache/spark
cd spark
build/mvn -DskipTests clean package
The spark config files are located in conf. To launch the master & slaves:
sbin/start-all.sh
To run examples provided in Scala:
bin/run-example SparkPi 10
This run-examples shell runs the example provided with spark by calling spark-submit.
To submit jobs provided in Python:
bin/spark-submit examples/src/main/python/pi.py 10
To submit jobs provided in Java/Scala:
bin/spark-submit --class org.apache.spark.examples.SparkPi examples/target/spark-examples_*.jar 10
For more information about spark-submit see this.
The Spark has 3 main components:
Each components has their main function.
The communication between Master and Worker is provided by Akka.
Each example also has a main function, this is where the example code started.
So, there are totally 4 main functions
Take SparkPi as an example:
package org.apache.spark.examples
import scala.math.random
import org.apache.spark._
/** Computes an approximation to pi */
object SparkPi {
  def main(args: Array[String]) {
    val conf = new SparkConf().setAppName("Spark Pi")
    val spark = new SparkContext(conf)
    val slices = if (args.length > 0) args(0).toInt else 2
    val n = math.min(100000L * slices, Int.MaxValue).toInt // avoid overflow
    val count = spark.parallelize(1 until n, slices).map { i =>
      val x = random * 2 - 1
      val y = random * 2 - 1
      if (x*x + y*y < 1) 1 else 0
    }.reduce(_ + _)
    println("Pi is roughly " + 4.0 * count / n)
    spark.stop()
  }
}
Above code first created a
, which is used to process slice, by using parallelize.
Then the code created a new MapPartitionsRDD.
In MapPartitionsRDD, the previous ParallelCollectionRDD and the map function are stored but not “computed”.
Only in the reduce function, the map jobs and reduce jobs are scheduled.
More Examples About RDD can be find here