Compare Spark RDDs, DataFrames, and Spark SQL Benefits
Company: Experian
Role: Data Scientist
Category: Data Manipulation (SQL/Python)
Difficulty: Medium
Interview Round: Technical Screen
spark_jobs
+---------+---------------------+-------+-----------+---------+
| job_id | submit_time | user | memory_gb | status |
+---------+---------------------+-------+-----------+---------+
| 101 | 2023-10-20 10:35:00 | alice | 8 | success |
| 102 | 2023-10-20 11:10:00 | bob | 16 | running |
| 103 | 2023-10-20 12:40:00 | carol | 32 | failed |
+---------+---------------------+-------+-----------+---------+
##### Scenario
Big-data discussion on Spark and cloud pipelines
##### Question
Compare Spark RDDs, DataFrames and Spark SQL and state their advantages.
What benefits does Spark offer versus classic MapReduce?
Explain Spark’s lazy evaluation and how it optimizes execution.
Describe how you would submit and monitor jobs on AWS EMR or a similar managed cluster.
##### Hints
Mention in-memory processing, Catalyst and Tungsten optimizers, DAG scheduling, cost/scalability on the cloud.
Quick Answer: This question evaluates understanding of Apache Spark core abstractions (RDDs, DataFrames, Spark SQL), execution model concepts such as lazy evaluation and optimizer-driven execution, and operational competency with managed cloud cluster job submission and monitoring, categorized under Data Manipulation (SQL/Python) for a Data Scientist role.