Design MapReduce and Spark jobs

Q: Design MapReduce and Spark jobs

This is a Data Manipulation (SQL/Python) interview question from Other for Data Scientist roles. View the full question and solution on PracHub.

Q: How do I approach Data Manipulation (SQL/Python) interview questions?

Data Manipulation (SQL/Python) questions require understanding of core concepts and practice. PracHub provides solutions with explanations to help you master data manipulation (sql/python) interviews.

Question

Big data systems: (a) Explain Hadoop’s fault tolerance (HDFS replication, task re-execution) and why MapReduce includes shuffling and sorting; in a word-count job, specify mapper and reducer key–value pairs precisely. (b) Explain Spark’s RDD immutability and lineage-based fault recovery; contrast with Hadoop’s approach. (c) For top‑k word frequency per day on a 10 TB dataset, design a two-stage MapReduce (or Spark) pipeline that minimizes shuffles; justify partitioning and combiner usage.

Design MapReduce and Spark jobs

Comments (0)