This question evaluates proficiency in designing and optimizing distributed data processing jobs, covering Hadoop MapReduce and Spark concepts such as HDFS replication, task re-execution, shuffling/sorting, mapper and reducer key–value semantics, RDD immutability and lineage-based recovery, partitioning, and combiner usage.
Big data systems: (a) Explain Hadoop’s fault tolerance (HDFS replication, task re-execution) and why MapReduce includes shuffling and sorting; in a word-count job, specify mapper and reducer key–value pairs precisely. (b) Explain Spark’s RDD immutability and lineage-based fault recovery; contrast with Hadoop’s approach. (c) For top‑k word frequency per day on a 10 TB dataset, design a two-stage MapReduce (or Spark) pipeline that minimizes shuffles; justify partitioning and combiner usage.