HadoopConceptsNote

Chapter 1: Data Flow

Terminology

Job

  • Map Reduce program.
  • Contains the input data and configuration information.

Task

  • Dividing the job.
  • Map task an Reduce task
  • Scheduled using YARN.
  • Run on nodes in cluster.
  • Rescheduled on failed.

Split

  • One Map task for each Split
  • Overhead: Split can't be too small or there will be too much of them to process.
  • The suggested size is same with the block size of HDFS. The too large size of Split would cause the Map task need to grep the remainings from other nodes.