HadoopConceptsNote

Data Locality Optimization

locality

  • a: Schedule map task on the block host.
  • b: Schedule map task on same rack.
  • c: Schedule map task off rack node.

mrflow

  • Intermediate(Map) output:

    • Stored in local disk.
    • Consumed and processed by Reduce task.
    • Failure over: If the Map task failed or if the output host is crashed before the Reduce task consumption, Map task would be restarted on another node.
  • Reduce task no data locality:

    • Consume Map output. The Map output would be removed after the output has been copied to the node running Reduce task, and that's why the Map output should not be stored on HDFS.
    • The output of Reduce task would be stored on HDFS.
    • For the Reduce output, the first one replica would be stored on the node ran the Reduce task.