Design MapReduce and Spark jobs

Q: Design MapReduce and Spark jobs

This question evaluates proficiency in designing and optimizing distributed data processing jobs, covering Hadoop MapReduce and Spark concepts such as HDFS replication, task re-execution, shuffling/sorting, mapper and reducer key–value semantics, RDD immutability and lineage-based recovery, partitioning, and combiner usage.

Q: How do I approach Data Manipulation (SQL/Python) interview questions?

Data Manipulation (SQL/Python) questions require understanding of core concepts and practice. PracHub provides solutions with explanations to help you master data manipulation (sql/python) interviews.

Question

Big data systems: (a) Explain Hadoop’s fault tolerance (HDFS replication, task re-execution) and why MapReduce includes shuffling and sorting; in a word-count job, specify mapper and reducer key–value pairs precisely. (b) Explain Spark’s RDD immutability and lineage-based fault recovery; contrast with Hadoop’s approach. (c) For top‑k word frequency per day on a 10 TB dataset, design a two-stage MapReduce (or Spark) pipeline that minimizes shuffles; justify partitioning and combiner usage.

Design MapReduce and Spark jobs

Overview

Comments (0)