Apache Spark Execution And DataFrame Fundamentals

What's being tested

Interviewers are checking that you understand how Spark executes DataFrame operations end-to-end: the logical-to-physical planning, lazy evaluation, how transformations become stages/tasks, and the costs that cause slow jobs (notably shuffles and skew). For a Software Engineer, the focus is on reasoning about runtime behavior, choosing the right join/partitioning strategy, and making safe code-level tradeoffs (memory, serialization, and API choices) rather than operating-cluster or product metrics. Expect to explain cause → symptom → fix with concrete knobs and complexity estimates.

Core knowledge

Lazy evaluation: DataFrame transformations are recorded as a logical plan; an action triggers planning, optimization, and execution. This enables whole-stage optimizations and plan-based reordering.
Catalyst optimizer: rewrites logical plans (projection/predicate pushdown, constant folding); it produces a physical plan selected by cost rules — understand what it will and won't optimize automatically.
Tungsten execution: the physical execution focuses on efficient memory use and CPU (e.g., binary UnsafeRow, off-heap memory, vectorized processing) to reduce GC and improve throughput.
Narrow vs wide transformations: narrow (e.g., map, filter) keep data in the same partition; wide (e.g., groupBy, join, reduceByKey) require shuffle across executors and create stage boundaries.
Shuffle cost model: shuffles incur disk/network/serialization I/O; cost ≈ O(data_size) plus overhead per task. Number of output tasks = number of partitions. Tune spark.sql.shuffle.partitions to match cluster cores and input size.
Join strategies: BroadcastHashJoin (works when one side < spark.sql.autoBroadcastJoinThreshold, default 10MB), ShuffleHashJoin for medium, SortMergeJoin for large; broadcast avoids shuffle, SortMerge requires shuffle + sort.
Adaptive Query Execution (AQE): (Spark 3+) can change join types and partitioning at runtime based on actual stats, mitigating misestimates and skew; know when AQE helps vs when to disable it.
Partitioning & skew: good key distribution prevents hot tasks; use repartition, coalesce or salting to break keys when skewed. Target partition count ≈ 2–4× total cores to fill and parallelize work without too many tiny tasks.
Caching and persistence: persist() / cache() materialize RDD/DataFrame in memory (or disk); beneficial only when reused multiple times and when memory fits — otherwise cause eviction and GC pressure.
Serialization & encoders: Dataset uses Encoder and can avoid Java serialization via UnsafeRow; Python (pyspark) crosses the JVM-Python boundary and pays serialization cost per partition/row when using UDFs.
UDFs vs native expressions: plain SQL/DataFrame operations benefit from Catalyst & whole-stage codegen; UDFs (Python or non-optimized JVM) break optimizations and can disable vectorization/whole-stage codegen.
Fault boundaries: stages are the retry unit; long-running tasks due to skew or excessive per-task memory make retries expensive; consider checkpointing only for very long lineage chains to bound DAG size.

Worked example

(Representative interview prompt: "Explain how to optimize a join between a large and a small DataFrame.") Frame the problem: ask data sizes, cardinality of join key, whether small side fits memory, and whether keys are skewed; declare assumptions (e.g., small side ≈ 20MB, join key cardinality moderate). Organize your answer around three pillars: (1) choose join strategy (BroadcastHashJoin if small side < spark.sql.autoBroadcastJoinThreshold), (2) reduce data before join (projection, filter, select), (3) partitioning and skew handling (ensure small-side broadcast, or pre-repartition by join key if not broadcastable). Explicit tradeoff: broadcasting saves shuffle but increases driver/executor memory pressure and network broadcast cost — if small side is borderline (e.g., 50–200MB), consider increasing threshold carefully or use broadcast hint with monitoring. Close by listing actionable checks: inspect explain() physical plan, run with AQE enabled, profile task durations, and if more time would add micro-benchmarks for different thresholds and test with realistic data.

A second angle

(Representative different prompt: "A job running groupBy on a Parquet table is very slow — what do you inspect and change?") Use the same execution concepts but shift focus: check predicate pushdown and column pruning are applied to reduce read I/O, confirm Parquet columnar layout helps vectorized reads, and verify file sizes (many tiny Parquet files can explode task count and overhead). If grouping creates heavy shuffle, reduce shuffle amount by pre-aggregating (mapPartitions with local aggregation), adjust spark.sql.shuffle.partitions, and consider using approx_count_distinct or streaming/incremental aggregation if exact full aggregation is unnecessary. The transfer principle: diagnose where a wide transformation creates shuffle and then minimize data shuffled or change the operator.

Common pitfalls

Pitfall: Blaming Spark without checking the physical plan — many fixes come from reading df.explain(true) and identifying shuffle boundaries, broadcast hints, and missing predicate pushdown.

Pitfall: Over-broadcasting a borderline dataset — forcing broadcast() or increasing spark.sql.autoBroadcastJoinThreshold can OOM executors or trigger large network broadcasts; quantify memory cost before forcing.

Pitfall: Using Python UDFs for simple transformations — they prevent Catalyst optimizations and incur JVM-Python serialization; prefer built-in SQL/DataFrame functions or Scala/Java UDFs with care.

Connections

Interviewers frequently pivot to streaming (how stateful aggregations and checkpointing change execution guarantees) or to lower-level resource allocation (executor sizing, cores-per-task tradeoffs). They may also ask about the contrast between RDD and DataFrame/Dataset APIs when control over serialization or partition-local processing is required.