Databricks Software Engineer Interview Prep Guide
Everything Databricks actually asks Software Engineer candidates — concept walkthroughs, worked examples, and the real interview questions, drawn from candidate reports. Free to read.
Last updated

Focus most on system design, concurrency, durable storage, fault tolerance, and the repeated Databricks coding patterns because your self-ratings are 1/5 for system design and 2/5 for coding with no solved-question signals yet. There are no strong self-rated areas to keep truly brief, so graph search, top-k aggregation, and tic-tac-toe/game-tree material stay at normal review depth rather than expansion. The Databricks-specific additions highlight Spark execution fundamentals, Delta Lake transaction/metadata design, and cluster/job scheduling trade-offs. With less than one week left, prioritize the emphasized sections first and use the normal items as fast pass reviews after the core patterns feel interview-ready.
Technical Screen — 69 min
Coding & Algorithms
-
IPv4 CIDR Rule Matching (Focus) — covered in depth under Onsite below.
-
Sliding Window Counters And QPS (Focus) — covered in depth under Onsite below.
-
Snapshotable Collections And Iterators (Focus) — covered in depth under Onsite below.
-
Recursion, Dynamic Programming, And Implicit Structures (Focus) — covered in depth under Onsite below.
System Design
Durable Key-Value Stores And Caches
Focus areaFocus area — System design is 1/5 and you viewed system design most; prioritize WALs, recovery, compaction, indexing, and cache eviction.

What's being tested
Databricks is probing whether you can design storage-backed, concurrent key-value systems with clear APIs, predictable failure behavior, and justified performance tradeoffs. A strong answer covers both the data path (get, put, delete) and the failure path: crash recovery, partial writes, fsync semantics, corruption, and concurrent mutations. Interviewers are not looking for a production clone of `RocksDB`; they want to see whether you can reason from first principles about durability, in-memory indexing, eviction, synchronization, and time-windowed counters. This matters for a Software Engineer at Databricks because many platform components depend on local metadata stores, caches, execution-state stores, and high-QPS services where correctness under concurrency is as important as throughput.
Core knowledge
-
API shape should be explicit before internals:
get(key) -> Optional<Value>,put(key, value),delete(key), optionalcompareAndSwap(key, expectedVersion, newValue),ttl,scan(prefix), andflush. Clarify whether operations are single-key atomic, whether values are bytes or generic typed objects, and whether reads after writes must be immediately visible. -
Durability usually starts with a write-ahead log: append mutation records to disk before applying them to in-memory state. On restart, rebuild the in-memory map by replaying the log in order. The key invariant is: if
put(k,v)is acknowledged, the log record must survive a crash, typically requiringfsyncor group commit. -
Crash recovery requires handling torn writes and partial records. A common record format is
[magic][length][sequence_number][op][key][value][crc32]. During recovery, stop at the first invalid checksum or incomplete length-delimited record. Sequence numbers help resolve duplicate replay and preserve last-writer-wins ordering. -
On-disk layout tradeoffs are central. An append-only log gives fast writes, but unbounded disk growth; a hash index in memory maps keys to file offsets for fast reads. Periodic compaction rewrites only the latest value for each key and drops tombstones older than any active reader.
-
LSM-tree designs, like
`LevelDB`and`RocksDB`, use a mutable memtable, immutable sorted runs, and background compaction. They optimize write throughput and range scans but introduce read amplification and compaction stalls. A simple interview design can mention LSM as an extension, not start there unless range scans or large data volumes require it. -
B-tree designs, like many database indexes, update pages in place and are good for read-heavy workloads and range queries. They require careful page-level crash safety through journaling, copy-on-write, or WAL redo/undo. For an embeddable KV store with simple point lookups, append-only plus compaction is often easier to reason about.
-
Concurrency control must identify shared mutable state: the map, log file offset, LRU list, TTL heap, hit counters, and compaction metadata. A simple design uses a
ReadWriteLock: concurrentgets acquire read locks;put/deleteacquire write locks. Higher-throughput designs use lock striping by key hash plus a separate serialized log append path. -
Race conditions often come from compound operations: “check then insert,” LRU update during
get, TTL expiration while a writer updates the same key, or computing`QPS`while requests are being recorded. The fix is to define the linearization point: forput, it might be after WAL append succeeds and before the map pointer is swapped. -
Persistent cache design combines eviction policy with durability. A typical in-memory cache has
HashMap<Key, Node>plus a doubly linked list for LRU, givingO(1)get,put, and eviction. If persistence is required, evictions and updates must also be logged or the cache may resurrect evicted entries after restart. -
TTL semantics need precision. Store
expiresAt = now + ttlusing a monotonic clock where possible for in-process comparisons, but persist wall-clock timestamps if expiration must survive restart. Lazy expiration checks ongetare simple; proactive expiration uses a min-heap or timing wheel but adds concurrency complexity. -
Sliding-window QPS can be implemented with fixed-size time buckets. For a window seconds and bucket size , keep buckets containing request counts and timestamps. Average QPS is after ignoring stale buckets. Smaller buckets improve accuracy but increase update contention and memory.
-
Serialization and versioning matter for a generic store. Use a pluggable
`Serializer<T>`interface that returns bytes and can fail explicitly. Include a schema/version byte in records so future readers can decode old values. Avoid Java/Python object serialization as a default because it is brittle across code changes and unsafe for untrusted data.
Worked example
For Design a durable key-value store, start by clarifying scope in the first 30 seconds: “Is this single-node or distributed? Are keys and values bounded in size? Do we need range scans, TTL, transactions, or only single-key atomicity? What durability guarantee is expected after put returns?” Then declare a reasonable baseline: single-node embedded store, byte-array keys and values, point reads/writes, last-writer-wins, and durability for acknowledged writes.
Organize the answer around four pillars: API and guarantees, write/read path, recovery and compaction, and concurrency/performance. The write path is: serialize record, append to WAL, fsync according to policy, then update the in-memory hash index from key to latest value location. The read path is: check the in-memory index, seek to the offset if values are only on disk, or return directly if the full value is cached in memory.
For recovery, replay valid WAL records from the beginning or from the latest snapshot, rebuild the hash index, and stop at the first partial/corrupt record using a checksum. For compaction, write a new segment containing only the latest live records, atomically install a manifest pointing to the new segment, then delete old segments after no readers depend on them. A key tradeoff to flag is latency versus durability: calling fsync on every write gives strong guarantees but high `p99`; batching fsync every few milliseconds improves throughput but can lose recent acknowledged writes unless acknowledgments wait for the batch flush. Close by saying that with more time you would add checksummed segment files, background compaction scheduling, metrics for log size and recovery time, and optional compareAndSwap for conditional updates.
A second angle
For Design a single-node persistent in-memory cache, the same storage ideas apply, but the priority shifts from durable database semantics to bounded memory and fast access. The core structure is HashMap<Key, Entry> plus an LRU list or segmented LRU policy, with a WAL recording put, delete, and eviction events so restart reconstructs the same logical cache. The tricky part is that get mutates recency state, so a supposedly read-only operation can contend on the LRU lock. A strong design might batch recency updates, use sharded LRUs, or accept approximate LRU to reduce contention. Unlike the durable store, it is acceptable to discuss weaker durability if the cache can be rebuilt, but you must state that assumption explicitly.
Common pitfalls
Pitfall: Treating “durable” as “write to a file.”
Writing bytes to a file is not enough; data can sit in OS page cache, records can be partially written, and directory metadata may not be durable after rename. A better answer distinguishes write, flush, fsync, atomic rename, checksums, and recovery behavior after crashes at different points.
Pitfall: Ignoring the linearization point under concurrency.
A tempting but weak answer says “use a mutex” without explaining what operation becomes atomic. Interviewers may push with two threads doing put(k,1) and get(k) while a WAL append is in progress; you should define whether visibility happens before or after the log is durable and ensure the map and log cannot disagree for acknowledged operations.
Pitfall: Overdesigning into a distributed database.
Do not jump to sharding, consensus, quorum reads, or `Raft` unless the prompt asks for multiple nodes. For these interviews, a precise single-node design with WAL, compaction, locking, and recovery is usually stronger than a vague distributed architecture. If you mention distributed extensions, keep them as optional follow-ups.
Connections
Interviewers may pivot from here into thread-safe data structures, LSM-tree storage engines, cache eviction algorithms, rate counters, or idempotent APIs. They may also ask how this differs from using `Redis`, `RocksDB`, `SQLite`, or an in-process `ConcurrentHashMap`, so be ready to compare guarantees rather than just features.
Further reading
-
Designing Data-Intensive Applications — Martin Kleppmann’s chapters on storage, replication, and consistency give strong mental models for logs, indexes, and crash recovery.
-
The Log-Structured Merge-Tree — the foundational paper behind many write-optimized KV stores such as
`LevelDB`and`RocksDB`. -
RocksDB Wiki — practical details on WALs, memtables, compaction, write amplification, and performance tuning in a real embedded storage engine.
Practice questions
Concurrency Control And Thread Safety
Focus areaFocus area — Low system-design confidence makes race analysis, locks, condition variables, snapshots, and synchronization costs essential Databricks preparation.

What's being tested
Interviewers are probing whether you can reason about shared mutable state under concurrent access without hand-waving away correctness. For Databricks, this matters because storage engines, metadata services, caching layers, and distributed execution components routinely serve many readers and writers while preserving durability, isolation, and predictable latency. A strong answer separates safety properties, like no lost updates or corrupted invariants, from liveness properties, like no deadlock, starvation, or unbounded blocking. The interviewer is usually looking for clear synchronization boundaries, explicit invariants, failure-mode analysis, and tradeoffs between coarse locking, fine-grained locking, lock-free techniques, and versioned designs.
Core knowledge
-
Thread safety means every public operation preserves the data structure’s invariants under arbitrary valid interleavings. Define the invariant first: queue size is always
0 <= size <= capacity, a key-value store index always points to a durable record, or a snapshot iterator observes a stable logical version. -
Race conditions happen when correctness depends on timing. The classic read-modify-write bug is
x = x + 1: two threads can read the same old value and lose one increment. Fixes include mutexes, atomic compare-and-swap, sharding counters, or redesigning the operation to be idempotent. -
Linearizability is the usual correctness target for concurrent in-memory APIs: every operation appears to take effect atomically at some instant between call and return. For example,
put(k, v)in a key-value store should have a clear linearization point, often the lock-protected index update or successfulCAS. -
Condition variables solve blocking coordination, not mutual exclusion by themselves. A bounded queue typically uses one mutex plus
notEmptyandnotFull; calls towait()must occur in awhileloop because of spurious wakeups and because another thread may consume the condition first. -
Backpressure is a correctness and stability tool. A bounded queue should define behavior for
enqueue: block forever, block until timeout, returnfalse, or throw. Capacity prevents unbounded memory growth, but poor wakeup policy can cause convoying or starvation under high producer/consumer contention. -
Lock granularity controls contention. A single global lock is simple and often sufficient up to moderate concurrency; striped locking hashes keys across, say, 64 or 256 locks to improve throughput. Fine-grained locks reduce contention but increase deadlock risk, complexity, and cache-coherence overhead.
-
Deadlock prevention requires a consistent lock order or avoiding nested locks. If a range cache locks chunk metadata and then an in-flight request map, every code path must acquire them in the same order. Timeouts detect symptoms but do not prove safety.
-
Read-write locks help only when reads dominate and read sections are nontrivial. They can hurt when writes are frequent, readers are short, or implementation favors readers and starves writers. For highly read-heavy maps, copy-on-write or RCU-style versioned snapshots may be cleaner.
-
MVCC and snapshot isolation let readers avoid blocking writers by reading immutable versions. A snapshotable set can store insert/delete version numbers per element, so an iterator at version
vreturns elements wherecreated <= v < deleted. This trades memory and garbage collection complexity for stable iteration. -
Write-ahead logging is the foundation of durability in a key-value store. The rule is: append the intent or record to the WAL and
fsyncas required before exposing the update in the in-memory index. Recovery replays committed log records; torn writes need checksums, lengths, and record boundaries. -
Atomicity across memory and disk is subtle. If
put(k, v)updates the in-memory index before the log is durable, a crash can expose acknowledged data that cannot be recovered. If the log is durable before the index update, recovery is safe because replay can rebuild the index. -
In-flight request deduplication is a concurrency-control pattern for caches. For a range-aware file cache, maintain a map like
(fileId, chunkId) -> Future; the first caller fetches the chunk, and later callers await the same future. On failure, remove the future so retries are possible.
Worked example
For Design a thread-safe bounded queue, a strong candidate starts by clarifying the API: enqueue(item, timeout), dequeue(timeout), size(), shutdown semantics, whether fairness is required, and whether null items are allowed. They would state assumptions early: fixed capacity, multiple producers and consumers, blocking operations with timeout, and linearizable behavior for successful operations. The answer can be organized around four pillars: internal state, synchronization strategy, blocking semantics, and edge cases.
The internal state is a circular buffer with head, tail, and count, where the invariant is 0 <= count <= capacity. The synchronization strategy is one mutex protecting all three fields, plus two condition variables: notFull for producers and notEmpty for consumers. enqueue waits in a while count == capacity loop, computes remaining timeout using a monotonic clock, inserts at tail, increments count, and signals notEmpty; dequeue mirrors this and signals notFull. The candidate should explicitly flag the tradeoff between signal() and broadcast(): signal() avoids thundering-herd wakeups, while broadcast() can be useful for shutdown or complex predicates.
A good answer also covers size(): either acquire the same mutex for exact linearizable size, or expose approximate size via an atomic counter if the API allows it. Fairness should be discussed separately from correctness; FIFO item order does not imply FIFO thread scheduling. To close, say that with more time you would add tests using randomized producer/consumer schedules, timeout boundary cases, interruption/shutdown behavior, and stress tests under tools like ThreadSanitizer or jcstress.
A second angle
For Design a durable key-value store, the same concept appears at the boundary between concurrent operations and crash recovery. Instead of only protecting an in-memory invariant like queue size, you must preserve a cross-layer invariant: acknowledged writes must survive process or machine crashes. A simple design uses a global write mutex around append WAL -> fsync policy -> update memtable/index, while concurrent readers use a read lock or immutable memtable snapshot. The key framing difference is that “thread-safe” is not enough: an update that is race-free in memory can still be incorrect if the crash happens after index mutation but before durable log persistence. The interviewer may push you toward higher throughput, where you discuss group commit, lock striping by key, immutable segments, and compaction coordination.
Common pitfalls
Pitfall: Treating
volatileor atomics as a universal replacement for locks.
Atomic visibility does not protect compound invariants like head, tail, and count moving together. A better answer says exactly which operations need mutual exclusion, which can use atomics safely, and where the linearization point is.
Pitfall: Describing a lock but not the waiting protocol.
For blocking structures, “use a mutex” is incomplete. You need to specify condition predicates, while-based waits, timeout recomputation, wakeup signaling, and what happens on shutdown, interruption, or cancellation.
Pitfall: Optimizing before proving correctness.
Jumping straight to lock-free queues, fine-grained range locks, or custom MVCC can sound sophisticated but often hides missing invariants. Start with a correct coarse-grained design, name its bottleneck, then evolve to striped locks, immutable snapshots, or in-flight future deduplication only when the workload justifies it.
Connections
Interviewers can pivot from this topic into storage engine design, especially WALs, memtables, compaction, and crash recovery. They may also move toward distributed concurrency control, including leases, optimistic concurrency, idempotency keys, and consensus-backed metadata updates. For coding-heavy follow-ups, expect implementation details around ReentrantLock, synchronized, std::mutex, Condition, Semaphore, or atomic CAS loops.
Further reading
-
The Art of Multiprocessor Programming by Herlihy and Shavit — rigorous treatment of linearizability, locks, nonblocking algorithms, and concurrent data structures.
-
Designing Data-Intensive Applications by Martin Kleppmann — excellent chapters on storage, transactions, isolation, logs, and distributed consistency tradeoffs.
-
SQLite Write-Ahead Logging documentation — practical example of WAL design, concurrent readers/writers, checkpoints, and durability tradeoffs.
Practice questions
Fault-Tolerant Backend System Design
Focus areaFocus area — Your 1/5 system rating makes retries, idempotency, persistence, recovery, observability, and consistency trade-offs a top priority.
What's being tested
Interviewers are probing your ability to design fault-tolerant, durable backend systems that maintain correctness under crashes, concurrency, and partial failures while meeting performance constraints. Expect to justify on-disk layouts, recovery paths, concurrency control, and tradeoffs between latency, throughput, and durability. Databricks values engineers who can make pragmatic choices (e.g., between simpler, auditable designs and complex but faster designs) and clearly communicate failure modes and mitigations.
Core knowledge
-
Durability: durable means committed state survives process and machine crashes; the common primitive is the write-ahead log (WAL) plus periodic checkpoints;
fsyncsemantics and mount/OS orderings matter for guarantees. -
WAL vs LSM-tree vs B-tree: WAL provides durable append-only logging; LSM-tree (used by
LevelDB/RocksDB) provides high write throughput via memtables and SSTables; B-tree (used bySQLite,Postgres) favors lower read amplification and in-place updates. -
Compaction & write amplification: compaction in LSMs trades CPU and I/O for read performance; write amplification ≈ total bytes written / user bytes written; typical LSM amplification can be 2–10× depending on levels and compaction strategy.
-
Crash recovery: replay the WAL up to last committed entry, then apply a checkpoint or snapshot; ensure idempotent replays or use sequence numbers to avoid double-application.
-
Atomicity for single-node systems: use atomic file rename or two-phase write with checksum + atomic swap for single-key atomic replacement; for multi-key transactions, make explicit isolation guarantees (e.g., snapshot isolation via MVCC).
-
Concurrency control: for in-process systems, low-latency designs use lock-free reads with copy-on-write memtables; otherwise use fine-grained locks, reader–writer locks, or MVCC to reduce read–write contention.
-
Range reads and caching: support range scans by organizing SSTables or block indexes to map key ranges to file offsets; use block-level caching plus request coalescing to deduplicate concurrent range fetches.
-
Failure modes at boundaries: durable persistence is only as strong as your metadata; store manifest/manifest checksums atomically and validate on startup to detect partial compactions or corrupted SSTables.
-
Eviction, persistence tradeoff: a single-node persistent in-memory cache should checkpoint periodically to disk; tune checkpoint interval against throughput and recovery time objective (RTO).
-
External integration patterns: for aggregators calling external price APIs, use idempotency keys, bounded retries, exponential backoff, and request coalescing to avoid thundering herds and inconsistent caches.
-
Observability & testing: surface
p99, throughput, recovery time, and corruption counts; test using crash-failure injection (kill during compaction, power-failfsyncomitted) and validate invariants with checksums. -
Sizing rules of thumb: memtable default sizes ~64–512MB; SSTable target file sizes ~64–512MB; these work up to hundreds of GB per node; beyond that, favor sharding/partitioning.
Worked example — Design a durable key-value store
First 30s: ask workload (read-heavy vs write-heavy), expected dataset size, durability level (sync-per-write vs async), single-process vs distributed, and whether range scans are required. Skeleton: (1) on-write path: append to WAL then apply to in-memory memtable; (2) background compaction: flush memtable → immutable SSTable and merge levels; (3) read path: check memtable(s) then SSTable indexes with block cache; (4) recovery: replay WALs and load manifest, validate checksums. Key tradeoff: choose sync-per-write fsync for highest durability but high latency, or group commits (batching) for throughput at cost of small data-loss window. Explicit decision: use LSM design (RocksDB-style) if writes dominate; choose B-tree-like approach for small dataset and range latency. Close: state how you'd iterate — add background checksumming, tune compaction strategy, expose metrics, and design a simple replication protocol if multi-node durability is required.
A second angle — Design concurrent range-aware file caching client
Same primitives apply but constraints differ: this client must deduplicate in-flight range requests, persist cached chunks, and validate freshness. Use a chunked on-disk layout and an in-memory index mapping ranges to chunk metadata, protect concurrent access with per-file range locks or lock-free interval trees for low contention. Implement request coalescing so simultaneous requests for overlapping ranges wait on one upstream fetch; use ETags or checksums for validation and conditional GETs to avoid stale data. Tune prefetch and eviction by access patterns: sequential reads favor larger prefetch windows, random reads favor small chunks and LRU. The durability question narrows to atomic writes of cached chunks (use temp-file + atomic rename) and a compact manifest to recover cache state after crashes.
Common pitfalls
Pitfall: Assuming
fsyncis instantaneous — many candidates omit OS and storage latencies; state realistic latencies and batch writes or expose sync-mode configuration to users.
Pitfall: Designing compaction without accounting for write amplification — proposing aggressive merges without quantifying I/O cost will lead to unacceptable backend load.
Pitfall: Overgeneralizing distributed solutions — avoid proposing full consensus (e.g.,
Raft) when the prompt is single-node; instead state why you'd add replication later and how you'd keep the single-node design simple first.
Connections
Interviewers may pivot to distributed consensus (e.g., Raft) when durability needs span nodes, or to storage engine internals like bloom filters and block indexes for read amplification optimizations. They may also ask about observability (SLOs, p99, recovery time) or efficient on-disk formats for analytics workloads.
Further reading
-
The Log-Structured Merge-Tree (LSM-tree) paper — canonical design explaining memtable/SSTable/compaction tradeoffs.
-
Raft: In Search of an Understandable Consensus Algorithm — why and how to add replication/consensus if you extend to multi-node durability.
-
RocksDBdesign notes — pragmatic engineering choices and tuning knobs for single-node high-throughput stores.
Practice questions
Focus area — Databricks interviews can reward Spark fluency; learn partitions, shuffles, lazy evaluation, stages, joins, and execution-plan trade-offs.
What's being tested
Interviewers are checking that you understand how Spark executes DataFrame operations end-to-end: the logical-to-physical planning, lazy evaluation, how transformations become stages/tasks, and the costs that cause slow jobs (notably shuffles and skew). For a Software Engineer, the focus is on reasoning about runtime behavior, choosing the right join/partitioning strategy, and making safe code-level tradeoffs (memory, serialization, and API choices) rather than operating-cluster or product metrics. Expect to explain cause → symptom → fix with concrete knobs and complexity estimates.
Core knowledge
-
Lazy evaluation:
DataFrametransformations are recorded as a logical plan; an action triggers planning, optimization, and execution. This enables whole-stage optimizations and plan-based reordering. -
Catalyst optimizer: rewrites logical plans (projection/predicate pushdown, constant folding); it produces a physical plan selected by cost rules — understand what it will and won't optimize automatically.
-
Tungsten execution: the physical execution focuses on efficient memory use and CPU (e.g., binary
UnsafeRow, off-heap memory, vectorized processing) to reduce GC and improve throughput. -
Narrow vs wide transformations: narrow (e.g.,
map,filter) keep data in the same partition; wide (e.g.,groupBy,join,reduceByKey) require shuffle across executors and create stage boundaries. -
Shuffle cost model: shuffles incur disk/network/serialization I/O; cost ≈ O(data_size) plus overhead per task. Number of output tasks = number of partitions. Tune
spark.sql.shuffle.partitionsto match cluster cores and input size. -
Join strategies:
BroadcastHashJoin(works when one side <spark.sql.autoBroadcastJoinThreshold, default 10MB),ShuffleHashJoinfor medium,SortMergeJoinfor large; broadcast avoids shuffle,SortMergerequires shuffle + sort. -
Adaptive Query Execution (AQE): (Spark 3+) can change join types and partitioning at runtime based on actual stats, mitigating misestimates and skew; know when AQE helps vs when to disable it.
-
Partitioning & skew: good key distribution prevents hot tasks; use
repartition,coalesceor salting to break keys when skewed. Target partition count ≈ 2–4× total cores to fill and parallelize work without too many tiny tasks. -
Caching and persistence:
persist()/cache()materialize RDD/DataFrame in memory (or disk); beneficial only when reused multiple times and when memory fits — otherwise cause eviction and GC pressure. -
Serialization & encoders:
DatasetusesEncoderand can avoid Java serialization viaUnsafeRow; Python (pyspark) crosses the JVM-Python boundary and pays serialization cost per partition/row when using UDFs. -
UDFs vs native expressions: plain SQL/DataFrame operations benefit from Catalyst & whole-stage codegen; UDFs (Python or non-optimized JVM) break optimizations and can disable vectorization/whole-stage codegen.
-
Fault boundaries: stages are the retry unit; long-running tasks due to skew or excessive per-task memory make retries expensive; consider checkpointing only for very long lineage chains to bound DAG size.
Worked example
(Representative interview prompt: "Explain how to optimize a join between a large and a small DataFrame.") Frame the problem: ask data sizes, cardinality of join key, whether small side fits memory, and whether keys are skewed; declare assumptions (e.g., small side ≈ 20MB, join key cardinality moderate). Organize your answer around three pillars: (1) choose join strategy (BroadcastHashJoin if small side < spark.sql.autoBroadcastJoinThreshold), (2) reduce data before join (projection, filter, select), (3) partitioning and skew handling (ensure small-side broadcast, or pre-repartition by join key if not broadcastable). Explicit tradeoff: broadcasting saves shuffle but increases driver/executor memory pressure and network broadcast cost — if small side is borderline (e.g., 50–200MB), consider increasing threshold carefully or use broadcast hint with monitoring. Close by listing actionable checks: inspect explain() physical plan, run with AQE enabled, profile task durations, and if more time would add micro-benchmarks for different thresholds and test with realistic data.
A second angle
(Representative different prompt: "A job running groupBy on a Parquet table is very slow — what do you inspect and change?") Use the same execution concepts but shift focus: check predicate pushdown and column pruning are applied to reduce read I/O, confirm Parquet columnar layout helps vectorized reads, and verify file sizes (many tiny Parquet files can explode task count and overhead). If grouping creates heavy shuffle, reduce shuffle amount by pre-aggregating (mapPartitions with local aggregation), adjust spark.sql.shuffle.partitions, and consider using approx_count_distinct or streaming/incremental aggregation if exact full aggregation is unnecessary. The transfer principle: diagnose where a wide transformation creates shuffle and then minimize data shuffled or change the operator.
Common pitfalls
Pitfall: Blaming Spark without checking the physical plan — many fixes come from reading
df.explain(true)and identifying shuffle boundaries, broadcast hints, and missing predicate pushdown.
Pitfall: Over-broadcasting a borderline dataset — forcing
broadcast()or increasingspark.sql.autoBroadcastJoinThresholdcan OOM executors or trigger large network broadcasts; quantify memory cost before forcing.
Pitfall: Using Python UDFs for simple transformations — they prevent Catalyst optimizations and incur JVM-Python serialization; prefer built-in SQL/DataFrame functions or Scala/Java UDFs with care.
Connections
Interviewers frequently pivot to streaming (how stateful aggregations and checkpointing change execution guarantees) or to lower-level resource allocation (executor sizing, cores-per-task tradeoffs). They may also ask about the contrast between RDD and DataFrame/Dataset APIs when control over serialization or partition-local processing is required.
Further reading
-
Spark SQL, DataFrames and Datasets Guide — Apache Spark docs — concise reference for join strategies and configuration knobs.
-
Spark: The Definitive Guide (Bill Chambers & Matei Zaharia) — practical coverage of execution model, tuning, and examples.
-
Databricks blog: Adaptive Query Execution in Spark 3.0 — explains AQE tradeoffs and behavior.
Practice questions
Focus area — This is Databricks-specific and not duplicated by the base outline; focus on transaction logs, snapshots, schema evolution, and compaction.
What's being tested
Interviewers are probing your grasp of how Delta Lake implements ACID transactions and manages table metadata at scale: the commit protocol, snapshot construction, conflict detection, and performance tradeoffs. For a Software Engineer, this tests distributed-systems thinking — correct atomicity/consistency semantics, scalability of metadata operations, and safe design choices under object-store constraints. You'll be expected to explain mechanisms, tradeoffs, and how to diagnose/mitigate metadata-related bottlenecks.
Core knowledge
-
Delta Lake (
Delta Lake) stores table state as an append-only transaction log (_delta_log/) composed of ordered JSON commit files and periodic Parquet checkpoints; readers build snapshots by reading the latest checkpoint then applying later JSON commits. -
Transaction actions include
AddFile,RemoveFile,MetaData,Protocol, andCommitInfo; the commit file is the unit of atomic change and records the exact set of file-level diffs for a version. -
Atomic commit/visibility is achieved by creating a single new transaction file for the version; readers treat the highest-numbered complete commit file as the new snapshot — no two partial commits are visible.
-
Optimistic concurrency control (OCC): writers read a snapshot, compute file-level adds/removes, and succeed unless the set of affected files changed by a concurrent commit; conflicts are detected at commit time, causing retries or application-level conflict resolution.
-
Snapshot isolation semantics: readers see a consistent snapshot corresponding to a commit version; Delta provides snapshot isolation (not full serializability) — concurrent writes that touch the same files can conflict and be rejected.
-
Checkpointing cost model: reading cost ≈ cost(checkpoint read of M files) + cost(apply N commits since checkpoint). If there are many small commits (large N) scans & startup become expensive — increase checkpoint frequency or compact commits.
-
Metadata growth & mitigation: many small files or high commit rates blow up
_delta_log; mitigate by periodic checkpoints, file compaction (e.g.,OPTIMIZE), batching writes, or using write-side coalescing to reduce churn. -
Schema enforcement vs evolution: schema enforcement rejects writes that violate the current schema; schema evolution is a logged
MetaDatachange and should be gated (e.g.,mergeSchemaflags) to avoid accidental incompatible upgrades. -
Garbage collection & time travel:
RemoveFileentries tombstone data files;VACUUMphysically deletes files older than retention threshold, trading off storage vs availability of historical versions andtime travel. -
Protocol/versioning: Delta uses a protocol header (
Protocolaction) to gate features — writers/readers must negotiate minimum protocol versions to enable new guarantees or actions. -
Object-store semantics matter: atomic rename semantics differ across
HDFSvsS3/GCS; Delta relies on ordering of commit filenames and eventual consistency behavior; system design must handle retries and listing inconsistencies. -
Streaming & idempotency: when using Delta as a streaming sink, commit metadata (e.g.,
CommitInfo) and idempotency keys help achieve exactly-once semantics over retries; commit atomicity plus idempotent writes are core.
Tip: quantify costs with N = commits since last checkpoint and M = files in checkpoint; aim to keep N small (tens–hundreds) for low-latency snapshot construction in latency-sensitive systems.
Worked example — "Explain how Delta Lake implements ACID transactions and snapshot isolation"
First 30s: clarify expected concurrency (single writer vs many), storage backend (HDFS/S3), and whether time-travel/streaming are required. State assumptions: object store provides atomic object creation and consistent listing eventually.
Skeleton of answer:
-
Describe the transaction log model: sequence of commit files (
_delta_log/) plus checkpoints to represent table versions. -
Explain a writer’s commit flow: read snapshot → compute file-level diffs → write new commit JSON → publish atomically (unique versioned filename).
-
Explain conflict detection: optimistic concurrency checks that files a writer planned to remove still exist; if not, commit fails and must retry.
-
Explain reader semantics: reconstruct snapshot by reading latest checkpoint + applying later commits → yields snapshot isolation semantics.
-
Close with metadata scaling: emphasize checkpointing frequency, compaction, vacuums, and protocol version.
One tradeoff to flag: aggressive checkpointing reduces read startup latency (small N) but costs CPU/I/O and increases write latency during checkpoint creation; choose frequency based on commit rate and read-latency SLOs.
If I had more time, I'd sketch retry/backoff strategies, a metrics/alerting plan for slow snapshot construction, and how to instrument commit conflicts.
A second angle — "How would you scale Delta metadata for a high-write-rate table with many small files?"
Reframe: here the core concept (transaction-log-based metadata) stays the same but constraints shift to operational scalability. Answer in 4–6 sentences: focus on reducing commit churn and limiting how far reads must replay log entries. Propose batching small writes server-side or client-side coalescing into fewer AddFile actions, increasing checkpoint frequency to cap N, and periodic compaction (OPTIMIZE) to reduce file counts M. Discuss tradeoffs: more compaction increases CPU and may block writers unless done carefully (background compaction, incremental rewriting). Also consider write-path changes: using a write-master/coordinator to produce larger committed units or using micro-batch streaming with larger micro-batches to lower commit rate. Don't forget VACUUM retention implications on time travel: aggressive cleanup reduces storage but aborts older time-travel queries.
Common pitfalls
Pitfall: Equating snapshot isolation with serializability.
Many candidates say Delta provides full serializability; correct answer is it provides snapshot isolation via optimistic concurrency — concurrent transactions can be aborted on conflicting file changes, but some write-write anomalies remain unless additional coordination is added.
Pitfall: Ignoring checkpoint and log replay costs.
A tempting but wrong solution is to assume metadata operations are constant-time; failing to reason about N (commits since checkpoint) leads to designs that hit long read startup times and OOMs when_delta_logexplodes.
Pitfall: Overlooking object-store semantics.
Assuming POSIX rename is atomic leads to brittle designs onS3/GCS. Better to describe how commit filenames, unique versioning, and idempotent publishing mitigate eventual-consistency/listing edge cases.
Connections
Delta’s transaction log and metadata discussion naturally pivots to Change Data Feed (CDC) / Change-streams, streaming exactly-once sinks, and table layout topics such as partitioning and file compaction. Interviewers may also pivot to object-store consistency models and how they affect distributed commit protocols.
Further reading (optional)
- Delta Lake protocol docs — canonical spec of actions, checkpoints and protocol versioning.
Practice questions
Focus area — Databricks systems depend on efficient shared compute; review scheduling, retries, isolation, autoscaling, fairness, and noisy-neighbor trade-offs.
What's being tested
Interviewers probe your ability to design and reason about a multi-tenant cluster scheduler that balances throughput, job latency, and tenant isolation. They expect familiarity with real scheduler primitives (admission control, fairness, preemption, gang scheduling), practical isolation mechanisms (container limits, `cgroups`), and the tradeoffs between utilization and predictability. Databricks cares because efficient, isolation-safe scheduling directly affects job completion times, cluster utilization, and customer SLAs for analytics and ML workloads.
Core knowledge
-
Scheduler objectives: trade off throughput (jobs/sec), mean/median job completion time, tail latency (
p99), utilization, and fairness; optimize using weighted combinations or SLO-driven admission control. -
Scheduling algorithms: offline bin-packing is NP-hard; use heuristics like First-Fit Decreasing, Best-Fit, and variant multi-resource bin-packing; quantify cost: heuristics are O(n log n) or O(n·m).
-
Multi-resource fairness: Dominant Resource Fairness (DRF) generalizes fairness across CPU/memory/GPU; each tenant’s allocation is measured by its dominant share and used to equalize shares.
-
Gang scheduling and co-allocation: for parallel jobs (e.g.,
Sparkexecutors), schedule whole groups atomically to avoid head-of-line blocking; implement via reservation or hold-and-backfill strategies to avoid fragmentation. -
Preemption & checkpoints: preemption reduces tail latency but wastes work; pair with checkpointing or application-level retries; quantify break-even: preemption cost < waiting-time improvement.
-
Backfilling & headroom: backfilling increases utilization by running small jobs in gaps; requires accurate runtime estimates and strict admission control to prevent starving larger jobs.
-
Admission control & autoscaling: enforce capacity by rejecting or queuing additional jobs; autoscaling adds nodes when sustained headroom < threshold; avoid thrash via cooldown and hysteresis.
-
Resource isolation primitives: use
`cgroups`(v1/v2) for CPU shares/sets, memory limits and OOM handling, and Linux namespaces for network/IPC isolation; CPU pinning reduces jitter but increases fragmentation. -
OOM and memory accounting: memory limits may trigger OOM killer; prefer hard limits for tenants requiring isolation, soft limits plus swap for best-effort workloads; monitor RSS vs. cached pages.
-
Network and disk QoS: isolate noisy neighbors via token buckets,
tcfor bandwidth shaping, and per-tenant IOPS limits (kernel blkio or storage-level QoS). -
Scheduling metrics & formulas: utilization = busy_time / total_time; job slowdown = (wait_time + run_time) / run_time; use exponential moving averages for runtime predictions.
-
State & failure modes: scheduler must handle node flakiness (eviction, transient disconnects) and adopt conservative allocations (replica placement, avoiding correlated failures, respecting rack-awareness).
Worked example — "Design a multi-tenant scheduler that minimizes average job completion time while ensuring tenant isolation"
First 30s: ask about workload mix (short vs long jobs, batch vs streaming), resource types (CPU/GPU/memory/disk/net), SLA per tenant, and whether jobs support checkpointing. Declare assumptions: heterogeneous nodes, preemption allowed, and accurate runtime estimates within ±30%.
Skeleton of answer:
-
Define objectives and SLOs: weighted average completion time + tenant isolation constraints (max dominant share).
-
Admission control layer: per-tenant queue with rate/weight and
`DRF`-based capacity caps; reject/queue when tenant exceeds share. -
Core scheduler: combine best-fit decreasing for bin-packing with backfilling for small jobs; apply gang scheduling for parallel jobs.
-
Isolation enforcement: use
`cgroups`for CPU/memory limits and network shaping for noisy neighbors; implement soft limits for best-effort workloads. -
Preemption and checkpointing: preempt only low-priority or best-effort jobs; require checkpointing for preemptible long jobs to bound wasted work.
Tradeoff to flag: aggressive preemption reduces average wait time but increases wasted compute and may reduce utilization—quantify threshold when preemption yields net benefit. Close: if more time, add runtime prediction model, adaptive backoff between backfilling and strict capacity, and multi-cluster placement optimization considering data locality.
A second angle — "How to support bursty ML training jobs with large GPU/Memory requests"
Same primitives apply but constraints shift: GPUs are scarce and hard to preempt. Prioritize gang scheduling and reservation windows to co-allocate all GPUs, use bin-packing across node-level GPU topology (NUMA and PCIe), and prefer scheduling policies that reserve headroom for expected bursts. Make checkpointing mandatory for long GPU jobs; where checkpointing is expensive, prefer pre-reservation and admission control rather than preemption. For throughput, consider opportunistic packing of small experiments into idle GPUs using container-level isolation and cgroup GPU accounting (or device plugin frameworks in `Kubernetes`).
Common pitfalls
Pitfall: Over-optimizing for utilization. Many candidates maximize utilization and neglect tail latency or SLA guarantees; explicitly show how utilization gains would impact tenant SLOs and why a balanced objective is needed.
Pitfall: Treating resources as fungible. Saying "just pack CPU and memory" ignores multi-resource conflicts—explain DRF or dominant-share metrics to avoid starvation when jobs are bottlenecked by different resources.
Pitfall: Ignoring preemption costs. Proposing aggressive preemption without quantifying wasted work or checkpoint overhead looks naive; estimate checkpoint frequency, time-to-recover, and net benefit.
Connections
Interviewers may pivot to autoscaling and cluster elasticity, data locality & network-aware placement, or scheduler implementation details (leader election, persistence, and high-availability). They might also ask about runtime prediction models for job durations or how to instrument scheduler metrics (e.g., wait distributions, fairness indices).
Further reading
-
Dominant Resource Fairness paper (Ghodsi et al., NSDI 2011) — formalizes fair allocation across multiple resource types.
-
Google Borg: Cluster Management at Scale — practical lessons on scheduling, preemption, and multi-tenancy from a production scheduler.
-
`Kubernetes`Scheduler Documentation — concrete implementation details, plugins, and evictions for containerized environments.
Practice questions
Onsite — 33 min
Coding & Algorithms
IPv4 CIDR Rule Matching
Focus areaFocus area — Your 2/5 coding rating and zero solved signals make IP parsing, masks, ordering, and overlap cases a high-yield screen topic.

What's being tested
IPv4 CIDR rule matching tests bit-level reasoning, parsing, and choosing the right lookup structure for prefix/range containment. Interviewers probe whether you can implement correctness first, then scale from linear scans to prefix tries, sorted ranges, or bucketed lookups with clear rule-priority semantics.
Patterns & templates
-
IPv4 parsing via
`ip_to_int(s)`— split four octets, validate0..255, compute(a<<24)|(b<<16)|(c<<8)|d; use unsigned 32-bit logic. -
CIDR containment with masks — for
a.b.c.d/p,mask = (0xffffffff << (32-p)) & 0xffffffff; match when(ip & mask) == (base & mask). -
Range conversion — CIDR block maps to
[start, end], wherestart = base & mask,end = start | (~mask & 0xffffffff); useful for interval search. -
Linear scan baseline —
O(R)per query,O(R)space; acceptable if rule count is small or “first matching rule wins” dominates. -
Binary trie for longest-prefix match — insert 32 bits, store rule/action at nodes; query in
O(32)time, spaceO(total prefix bits). -
Priority handling — distinguish first rule wins, last rule wins, and longest prefix wins; store insertion index or best-so-far metadata explicitly.
-
Dynamic updates — trie insert/delete is
O(32); sorted interval structures need rebalancing and careful overlap handling.
Common pitfalls
Pitfall: Treating IPs as strings causes wrong ordering and containment; always normalize to a 32-bit integer before comparison.
Pitfall: Mishandling
/0and/32;/0matches everything, while/32matches exactly one address.
Pitfall: Optimizing before clarifying semantics; “first CIDR block covering IP” and “longest-prefix firewall rule” require different lookup behavior.
Practice these
The practice cards below cover the canonical variants — solve all of them and time yourself.
Practice questions
Sliding Window Counters And QPS
Focus areaFocus area — This appears in both rounds and connects to backend metrics; emphasize queues, buckets, cleanup, and time/space trade-offs.

What's being tested
These problems test time-windowed aggregation: maintaining counts, rates, or averages over the last seconds without scanning all historical events. Interviewers look for clean data structure tradeoffs, correct expiry logic, and complexity analysis under monotonic timestamps, timestamp collisions, and high event volume.
Patterns & templates
-
Circular bucket array — store
(timestamp, count)per second;hit(t)andget(t)areO(1)/O(W), spaceO(W). -
Lazy bucket reset — when
t % Wis reused, reset bucket if stored timestamp differs; prevents stale counts from leaking. -
Deque of timestamps/events — append on hit, pop expired while
front <= now - W; amortizedO(1), space proportional to recent hits. -
Aggregated deque buckets — store
(bucketStart, count)for sparse streams or range queries; merge same bucket, evict old buckets. -
Running total optimization — maintain
totalalongside buckets/deque sogetCount()isO(1)after evicting expired entries. -
QPS formula — average QPS is
events_in_window / window_seconds; clarify whether denominator is fixedWor elapsed warm-up time. -
Per-key counters — use
Map<Key, Counter>for KV-store variants; evict inactive keys if memory bounds matter.
Common pitfalls
Pitfall: Forgetting timestamp collisions in modulo buckets;
t % Walone is not enough without storing the bucket’s real timestamp.
Pitfall: Off-by-one expiry errors; define whether the valid interval is
(now - W, now]or[now - W, now].
Pitfall: Claiming
O(1)queries while summing allWbuckets each time; either admitO(W)or maintain a running total.
Practice these
The practice cards below cover the canonical variants — solve all of them and time yourself.
Practice questions
RLE And Bit-Packing Compression
Focus areaFocus area — Onsite-only but Databricks-relevant; emphasize streaming codec state, bit boundaries, integer representation, and lossless correctness.

What's being tested
Tests lossless compression implementation with careful state management, bit manipulation, and edge-case handling. Strong answers show clean encoder/decoder symmetry, correct handling of 32-bit signed integers, and streaming-safe APIs that do not require loading all input into memory.
Patterns & templates
-
Run-length encoding groups consecutive equal values as
(value, count);encode_rleanddecode_rleareO(n)time with small state. -
Streaming RLE keeps
current_value,run_count, andpending_output; flush on value change, count limit, or end-of-stream. -
Bit packing stores fixed-width integers using
bit_buffer,bits_in_buffer, and masks like(1 << width) - 1; encode/decode areO(n). -
Signed integer handling requires preserving two’s-complement representation; mask with
0xFFFFFFFFbefore packing and reinterpret on decode. -
Header design separates metadata from payload: store mode, bit width, run length, or block size so decoder can unambiguously parse bytes.
-
Decoder symmetry matters: every encoder write path needs a corresponding read path; test round-trips with
decode(encode(x)) == x. -
Complexity target is
O(n)time andO(1)toO(block_size)auxiliary space; avoid string concatenation or per-bit arrays for large inputs.
Common pitfalls
Pitfall: Forgetting to flush the final RLE run produces correct output for mid-stream transitions but loses the last value.
Pitfall: Using arithmetic right shift or signed casts inconsistently can corrupt negative numbers during bit-packing decode.
Pitfall: Designing an ambiguous format, such as writing counts and values without delimiters, widths, or block metadata, makes decoding impossible.
Practice these
The practice cards below cover the canonical variants — solve all of them and time yourself.
Practice questions
Snapshotable Collections And Iterators
Focus areaFocus area — Repeated across rounds and closely tied to concurrency; practice stable views, versioning, tombstones, and cleanup trade-offs.

What's being tested
Tests versioned collection design: preserving stable iterator views while the underlying set mutates. You must demonstrate clear iterator semantics, complexity tradeoffs, and an implementation that handles add/remove/re-add without leaking deleted state into active snapshots.
Patterns & templates
-
Copy-on-iterator snapshot —
`iterator()`copies current elements into an array/list;O(n)creation,O(1)next, simplest correctness story. -
Copy-on-write set — clone backing
`HashSet`before mutation when snapshots exist; good when reads dominate, costly for frequent writes. -
Operation log with versions — store
`addVersion`/`removeVersion`per element; iterator captures`snapshotVersion`, filters by visibility inO(1)orO(log k). -
Reference-counted snapshots — track active iterators; defer cleanup of tombstoned elements until all older snapshots are closed or exhausted.
-
Stable iterator contract — define whether
`hasNext()`is idempotent, whether iteration order matters, and whether mutations during iteration are visible. -
Complexity framing — compare
`add`,`remove`,`contains`,`iterator`,`next`; strong answers explicitly optimize for expected read/write ratio.
Common pitfalls
Pitfall: Returning a raw
`HashSet`iterator gives fail-fast or live-view behavior, not snapshot semantics.
Pitfall: Treating remove as physical deletion immediately breaks older iterators that still need to see the element.
Pitfall: Ignoring re-add after remove; visibility must depend on version intervals, not just current membership.
Practice these
The practice cards below cover the canonical variants — solve all of them and time yourself.
Practice questions
Focus area — Repeated across rounds and easy to lose edge cases under pressure; emphasize recurrence design, indexing, and boundary reasoning.
What's being tested
These problems probe mastery of recursion on an implicit tree: compute and use subtree sizes given a Fibonacci-like recurrence, then map preorder indexes to tree paths with index arithmetic. Interviewers check correctness, off-by-one handling, and efficient traversal without materializing the whole tree.
Patterns & templates
- Compute subtree sizes with the Fibonacci recurrence and memoize (
`getSize(k)`); this is O(k) to build, constant-time lookups thereafter. - Preorder index navigation: compare target index to
`left_size`+ 1 to decide left/root/right; iterate until leaf — O(height) steps. - Iterative recursion: replace recursive descent with a loop that updates node rank and subtree order to avoid stack depth issues.
- Path encoding: emit directions (
`L`/`R`) while descending; reconstruct node-to-node path by computing LCA via index-to-path conversion. - Dynamic programming on indices: cache computed path segments or sizes across queries to answer multiple path requests in O(height) each.
- Guard against overflow: cap Fibonacci sizes at
`INT64_MAX`and clamp indices; check for invalid indexes early.
Common pitfalls
Pitfall: Off-by-one errors when treating
`left_size`, root position, and right subtree start — always test small k (1..4) to validate indexing.
Pitfall: Assuming tree height is log(n); for Fibonacci trees height ≈ index k, so complexity must be in terms of k, not total nodes.
Pitfall: Using recursion without tail-call elimination can blow the stack for deep k — prefer iterative descent or explicit stack.
Practice these
The practice cards below cover the canonical variants — solve all of them and time yourself.
Practice questions