The Power of Choice in Data-Aware Cluster Scheduling on OSDI 2014.

In order to process a large amount of data in a short amount of time, data sampling select, manipulate and analyze a representative subset of data points to identify patterns and trends in the larger data set being examined. In data intensive computing frameworks, such as Hadoop/Spark, large data are divided into partitions and dispersed on a large number of nodes. In these frameworks, computing tasks are scheduled co-located to these partitions. The data sampling in these frameworks process selected subset of partitions, and submit these selected process tasks to the frameworks. Then the computing frameworks schedule the tasks based on the locality of partitions. Unfortunate, the data sampling are unaware of the data locality. This could cause imbalanced workloads among nodes. As the amount of data becomes larger, more partitions will be processed, and the workload skew will become severe.

power-of-choice

The optimizing includes:

Locality Aware Input Tasks selection.
Launch Additional Upstream Tasks when encounter cross-rack network skew.
Multiple heuristic criteria, such as select at least one task from each rack, are used to selecting Best Upstream Outputs.
Drop Upstream Stragglers by choose a optimal wait time for downstream tasks.