PracHub
QuestionsPremiumCoachesLearningGuidesInterview Prep
|Home/Software Engineering Fundamentals/Point72

Explain Spark Execution and Optimization

Last updated: Jun 21, 2026

Quick Overview

This question evaluates a candidate's understanding of the Apache Spark execution model, lazy evaluation, distributed processing (including jobs, stages, tasks, and shuffles), and performance-tuning considerations for batch and streaming pipelines on columnar data lakes.

  • hard
  • Point72
  • Software Engineering Fundamentals
  • Data Engineer

Explain Spark Execution and Optimization

Company: Point72

Role: Data Engineer

Category: Software Engineering Fundamentals

Difficulty: hard

Interview Round: Technical Screen

You are interviewing for a **Data Engineer** role. In a 30-minute discussion on production data engineering, the interviewer probes how well you understand the way Apache Spark / PySpark actually executes work — and how you reason about performance. Answer the following Spark fundamentals questions in a practical, production context: assume you own batch and streaming pipelines that read from a columnar data lake (Parquet/Delta) and write curated tables consumed downstream. The discussion is structured in three parts that build on one another — the execution model (Parts 1–2) is the lens you use to justify the tuning decisions (Part 3). ### Constraints & Assumptions - **Engine:** Apache Spark 3.x (Adaptive Query Execution available), accessed via the PySpark DataFrame API. - **Scale:** multi-terabyte source tables; clusters with tens to hundreds of executor cores; a mix of large fact tables and small dimension tables. - **Storage:** columnar formats (Parquet/ORC/Delta) on object storage, partitioned by date. - **Goal:** reduce job runtime and cost (wall-clock *and* executor-hours), not just "make it work." ### Clarifying Questions to Ask A strong candidate scopes the problem before tuning. Up front, you would ask: - Is the workload **batch or streaming** (structured streaming with state), and what is the latency / freshness SLA? - What is the **data layout** at the source — file format, partitioning scheme, average file size, and total volume? - Is the pain a **single slow stage**, the whole job end-to-end, or intermittent failures (OOM / disk spill)? - What does the **cluster** look like — executor count, cores and memory per executor, and is dynamic allocation on? - Is the job **read-heavy, join-heavy, or write-heavy** downstream? - Are there **known skewed keys** or hot partitions in the data? ### Part 1 — Lazy evaluation What does it mean that Spark operations are **lazy**? Distinguish transformations from actions, explain *why* laziness exists, and give a concrete example of an optimization that laziness enables. ```hint Vocabulary Separate the two API categories: operations that *build a plan* versus operations that *force execution*. Which verbs actually trigger computation? ``` ```hint Why it pays off Because Spark sees the whole chain before running anything, it can rewrite it. Think about what the Catalyst optimizer can do across a `read → filter → select → write` pipeline — pushing work toward the source. ``` #### What This Part Should Cover - A crisp **transformation-vs-action** distinction, with correct examples on each side. - A correct statement of *when* Spark materializes work (at the action, against the accumulated logical plan). - *Why* laziness exists: it enables **whole-plan optimization** by Catalyst, not per-step execution. - At least one concrete optimization laziness unlocks (predicate/projection pushdown, partition pruning, operator reordering) tied to a real pipeline. ### Part 2 — Distributed execution How does Spark perform **distributed data processing** across a cluster? Describe the unit of parallelism, the roles of the components involved, and what makes some operations far more expensive than others. ```hint Decomposition Trace one action down the hierarchy: job → stage → task. What determines a *stage boundary*, and what is the smallest unit of data a single task operates on? ``` ```hint The expensive part Classify transformations by whether they require data to move between executors. The boundary that forces a network exchange is the thing you most want to minimize. ``` #### What This Part Should Cover - The **driver / cluster-manager / executor** roles, and that the driver does *not* process bulk data. - The **job → stage → task** hierarchy, with stage boundaries set by shuffles. - How **partitions map to parallelism**: one task processes one partition; task count is the degree of parallelism. - A clear **narrow-vs-wide** transformation framing and why **shuffles** dominate cost (disk write, network transfer, sort, spill). ### Part 3 — Optimizing slow / expensive Spark code How would you optimize slow or expensive Spark/PySpark code? Structure your answer as a **methodology first**, then cover the concrete levers. Touch on **execution plans, shuffles, joins, partitioning, caching, file layout, data skew, Python/UDF overhead, and streaming considerations** where relevant. ```hint Start with measurement, not config Resist jumping to a config flag. What do you read first to localize the bottleneck — scan volume vs. shuffle vs. skew vs. join strategy vs. spill vs. output write? ``` ```hint Joins and skew For joins, the choice is usually broadcast vs. shuffle (sort-merge). What lets you broadcast *safely*, and what techniques rescue you when one key holds a disproportionate share of the rows? ``` ```hint Less data, fewer exchanges Most wins come from moving less data: prune columns and partitions at the source, cut shuffles, and right-size the partition count. Consider what Adaptive Query Execution now automates for you. ``` #### What This Part Should Cover - A **measurement-first methodology** (Spark UI to find the slow stage + `explain` to read the physical plan) rather than a grab-bag of configs. - **Reduce-data-early** levers: column pruning, early filtering, partition pruning, predicate/projection pushdown. - Correct **join** reasoning: when to broadcast vs. sort-merge, and the safe-broadcast condition. - **Data skew** treated as a concrete, fixable problem (AQE skew handling, salting, hot-key isolation), not just named. - Sensible **partition sizing** (`repartition` vs. `coalesce`) and **AQE** awareness. - **File-layout** hygiene (small-file problem, compaction, partitioning/bucketing). - **Caching** as a deliberate, measured choice — not a reflex. - **Python/UDF** serialization cost and the pandas/Arrow (vectorized) UDF alternative. - **Streaming-specific** levers (watermarks, bounded state, trigger tuning, input-vs-processing-rate monitoring). ### What a Strong Answer Covers These cross-cutting dimensions span all three parts and are what separate a strong candidate from one who merely recites facts: - A consistent **throughline**: whether the candidate ties every tuning decision back to a single execution-model principle that connects plan-building, data movement, and cost — rather than treating each Part as an isolated topic. - **First-principles prediction**: reasoning from the execution model to predict where time and money go, rather than reaching for config flags. - **Iterative, evidence-driven thinking**: framing optimization as a loop — locate the dominant cost, address it, re-measure — not a one-shot change. ### Follow-up Questions - A stage runs 199 fast tasks and 1 task that takes 40× longer. Walk me through diagnosing and fixing it. - You broadcast a "small" dimension table and the driver / executors OOM. What happened, and how do you decide the safe broadcast size? - In a structured streaming job, the state store grows unbounded over days. What is the likely cause and the fix? - In the streaming pipeline, a small percentage (~2%) of records fail data-quality validation each batch. Would you stop the pipeline? Describe how you'd handle the failing records without dropping good data. - When would caching a DataFrame *hurt* performance, and how would you confirm it from the Spark UI?

Quick Answer: This question evaluates a candidate's understanding of the Apache Spark execution model, lazy evaluation, distributed processing (including jobs, stages, tasks, and shuffles), and performance-tuning considerations for batch and streaming pipelines on columnar data lakes.

Point72 logo
Point72
Apr 30, 2026, 12:00 AM
Data Engineer
Technical Screen
Software Engineering Fundamentals
9
0

You are interviewing for a Data Engineer role. In a 30-minute discussion on production data engineering, the interviewer probes how well you understand the way Apache Spark / PySpark actually executes work — and how you reason about performance. Answer the following Spark fundamentals questions in a practical, production context: assume you own batch and streaming pipelines that read from a columnar data lake (Parquet/Delta) and write curated tables consumed downstream.

The discussion is structured in three parts that build on one another — the execution model (Parts 1–2) is the lens you use to justify the tuning decisions (Part 3).

Constraints & Assumptions

  • Engine: Apache Spark 3.x (Adaptive Query Execution available), accessed via the PySpark DataFrame API.
  • Scale: multi-terabyte source tables; clusters with tens to hundreds of executor cores; a mix of large fact tables and small dimension tables.
  • Storage: columnar formats (Parquet/ORC/Delta) on object storage, partitioned by date.
  • Goal: reduce job runtime and cost (wall-clock and executor-hours), not just "make it work."

Clarifying Questions to Ask

A strong candidate scopes the problem before tuning. Up front, you would ask:

  • Is the workload batch or streaming (structured streaming with state), and what is the latency / freshness SLA?
  • What is the data layout at the source — file format, partitioning scheme, average file size, and total volume?
  • Is the pain a single slow stage , the whole job end-to-end, or intermittent failures (OOM / disk spill)?
  • What does the cluster look like — executor count, cores and memory per executor, and is dynamic allocation on?
  • Is the job read-heavy, join-heavy, or write-heavy downstream?
  • Are there known skewed keys or hot partitions in the data?

Part 1 — Lazy evaluation

What does it mean that Spark operations are lazy? Distinguish transformations from actions, explain why laziness exists, and give a concrete example of an optimization that laziness enables.

What This Part Should Cover

  • A crisp transformation-vs-action distinction, with correct examples on each side.
  • A correct statement of when Spark materializes work (at the action, against the accumulated logical plan).
  • Why laziness exists: it enables whole-plan optimization by Catalyst, not per-step execution.
  • At least one concrete optimization laziness unlocks (predicate/projection pushdown, partition pruning, operator reordering) tied to a real pipeline.

Part 2 — Distributed execution

How does Spark perform distributed data processing across a cluster? Describe the unit of parallelism, the roles of the components involved, and what makes some operations far more expensive than others.

What This Part Should Cover

  • The driver / cluster-manager / executor roles, and that the driver does not process bulk data.
  • The job → stage → task hierarchy, with stage boundaries set by shuffles.
  • How partitions map to parallelism : one task processes one partition; task count is the degree of parallelism.
  • A clear narrow-vs-wide transformation framing and why shuffles dominate cost (disk write, network transfer, sort, spill).

Part 3 — Optimizing slow / expensive Spark code

How would you optimize slow or expensive Spark/PySpark code? Structure your answer as a methodology first, then cover the concrete levers. Touch on execution plans, shuffles, joins, partitioning, caching, file layout, data skew, Python/UDF overhead, and streaming considerations where relevant.

What This Part Should Cover

  • A measurement-first methodology (Spark UI to find the slow stage + explain to read the physical plan) rather than a grab-bag of configs.
  • Reduce-data-early levers: column pruning, early filtering, partition pruning, predicate/projection pushdown.
  • Correct join reasoning: when to broadcast vs. sort-merge, and the safe-broadcast condition.
  • Data skew treated as a concrete, fixable problem (AQE skew handling, salting, hot-key isolation), not just named.
  • Sensible partition sizing ( repartition vs. coalesce ) and AQE awareness.
  • File-layout hygiene (small-file problem, compaction, partitioning/bucketing).
  • Caching as a deliberate, measured choice — not a reflex.
  • Python/UDF serialization cost and the pandas/Arrow (vectorized) UDF alternative.
  • Streaming-specific levers (watermarks, bounded state, trigger tuning, input-vs-processing-rate monitoring).

What a Strong Answer Covers

These cross-cutting dimensions span all three parts and are what separate a strong candidate from one who merely recites facts:

  • A consistent throughline : whether the candidate ties every tuning decision back to a single execution-model principle that connects plan-building, data movement, and cost — rather than treating each Part as an isolated topic.
  • First-principles prediction : reasoning from the execution model to predict where time and money go, rather than reaching for config flags.
  • Iterative, evidence-driven thinking : framing optimization as a loop — locate the dominant cost, address it, re-measure — not a one-shot change.

Follow-up Questions

  • A stage runs 199 fast tasks and 1 task that takes 40× longer. Walk me through diagnosing and fixing it.
  • You broadcast a "small" dimension table and the driver / executors OOM. What happened, and how do you decide the safe broadcast size?
  • In a structured streaming job, the state store grows unbounded over days. What is the likely cause and the fix?
  • In the streaming pipeline, a small percentage (~2%) of records fail data-quality validation each batch. Would you stop the pipeline? Describe how you'd handle the failing records without dropping good data.
  • When would caching a DataFrame hurt performance, and how would you confirm it from the Spark UI?

Solution

Show

Submit Your Answer to Earn 20XP

Sign in to leave a comment

Loading comments...

Browse More Questions

More Software Engineering Fundamentals•More Point72•More Data Engineer•Point72 Data Engineer•Point72 Software Engineering Fundamentals•Data Engineer Software Engineering Fundamentals
PracHub

Master your tech interviews with 8,000+ real questions from top companies.

Product

  • Questions
  • Learning Tracks
  • Interview Guides
  • Resources
  • Premium
  • For Universities
  • Student Access

Browse

  • By Company
  • By Role
  • By Category
  • Topic Hubs
  • SQL Questions
  • Compare Platforms
  • Discord Community

Support

  • support@prachub.com
  • (916) 541-4762

Legal

  • Privacy Policy
  • Terms of Service
  • About Us

© 2026 PracHub. All rights reserved.