Chapter 1: Data Flow
Terminology
Job
- Map Reduce program.
- Contains the input data and configuration information.
Task
- Dividing the job.
- Map task an Reduce task
- Scheduled using YARN.
- Run on nodes in cluster.
- Rescheduled on failed.
Split
- One Map task for each Split
- Overhead: Split can't be too small or there will be too much of them to process.
- The suggested size is same with the block size of HDFS. The too large size of Split would cause the Map task need to grep the remainings from other nodes.