It is very complicated to write a Hadoop Word Count example. The Spark’s Hello Word example is much more simpler:
val textFile = spark.textFile("hdfs://...")
val counts = textFile.flatMap(line => line.split(" "))
.map(word => (word, 1))
.reduceByKey(_ + _)
counts.saveAsTextFile("hdfs://...")
This magic is powered by the functional programming. The Scala language used by Spark has build in Map/Reduce functions. Scala also defined Map/flatMap and Mapvalues. If you only want to do single node word count, you don’t have to execute Spark/Hadoop. The Scala’s functional programming is enough.
1) create several lines of words.
scala> val lines = List("Hello world", "This is a hello world example for word count")
lines: List[String] = List(Hello world, This is a hello world example for word count)
2) use flatMap to convert to words lists
scala> val words = lines.flatMap(_.split(" "))
words: List[String] = List(Hello, world, This, is, a, hello, world, example, for, word, count)
3) group words to map
scala> words.groupBy((word:String) => word)
res9: scala.collection.immutable.Map[String,List[String]] = Map(for -> List(for), count -> List(count), is -> List(is), This -> List(This), world -> List(world, world), a -> List(a), Hello -> List(Hello), hello -> List(hello), example -> List(example), word -> List(word))
The structure of each map is Map[String,List[String]]
.
4) count length of each words list
scala> words.groupBy((word:String) => word).mapValues(_.length)
res12: scala.collection.immutable.Map[String,Int] = Map(for -> 1, count -> 1, is -> 1, This -> 1, world -> 2, a -> 1, Hello -> 1, hello -> 1, example -> 1, word -> 1)
Scala do not have build in reduceByKey function like Spark. That is why I use groupBy. Here is a discussion about why reduceByKey works much better then groupByKey on large datasets.