Spark, started at UC Berkeley AMPLab in 2009,
is a fast and general cluster computing system for Big Data.
Spark source contains a shell build/mvn
which automatically download compatible version of scala
and mvn
and compile the source code1.
git clone https://github.com/apache/spark
cd spark
build/mvn -DskipTests clean package
The spark config files are located in conf
. To launch the master & slaves:
sbin/start-all.sh
To run examples provided in Scala
:
bin/run-example SparkPi 10
This run-examples
shell runs the example provided with spark by calling spark-submit
.
To submit jobs provided in Python
:
bin/spark-submit examples/src/main/python/pi.py 10
To submit jobs provided in Java/Scala
:
bin/spark-submit --class org.apache.spark.examples.SparkPi examples/target/spark-examples_*.jar 10
For more information about spark-submit
see this.
The Spark
has 3 main components:
Each components has their main function
.
The communication between Master and Worker is provided by Akka.
Each example also has a main function
, this is where the example code started.
So, there are totally 4 main functions
Take SparkPi as an example:
package org.apache.spark.examples
import scala.math.random
import org.apache.spark._
/** Computes an approximation to pi */
object SparkPi {
def main(args: Array[String]) {
val conf = new SparkConf().setAppName("Spark Pi")
val spark = new SparkContext(conf)
val slices = if (args.length > 0) args(0).toInt else 2
val n = math.min(100000L * slices, Int.MaxValue).toInt // avoid overflow
val count = spark.parallelize(1 until n, slices).map { i =>
val x = random * 2 - 1
val y = random * 2 - 1
if (x*x + y*y < 1) 1 else 0
}.reduce(_ + _)
println("Pi is roughly " + 4.0 * count / n)
spark.stop()
}
}
Above code first created a
, which is used to process slice, by using parallelize
.
Then the code created a new MapPartitionsRDD.
In MapPartitionsRDD, the previous ParallelCollectionRDD and the map function
are stored but not “computed”.
Only in the reduce function
, the map jobs and reduce jobs are scheduled.
More Examples About RDD can be find here