Data Engineer Interview Questions
Master your tech interview with our curated database of real questions from top companies.
Implement RotatingFileSink in hierarchy
You are given an existing Python 3.10 codebase with an abstract base class RecordSink (from abc import ABC, abstractmethod) that defines open() -> Non...
Optimize SQL to minimize scans
Given a large analytics query, refactor it to minimize table scans. 1) Replace unnecessary CTEs that cause multiple scans with inline aggregations or ...
Solve SQL and Python coding tasks
You are given a small library system with the following relational schema and several Python data-processing tasks. Answer the SQL questions and imple...
Compute capacities after site closures
You are given a nested dictionary redistribution where redistribution[closed_site][dest_site] equals the additional capacity required at dest_site if ...
Solve library SQL and Python tasks
You are given a library domain. Assume these tables: - books(book_id, author_id, title) - authors(author_id, name) - copies(copy_id, book_id, conditio...
Validate alternating checkout/return logs
Given a chronological list of events logs of the form (timestamp, book_id, is_checkout) where is_checkout is True for a checkout and False for a retur...
Find customer with max rentals in consecutive weeks
You are given a table purchases(customer_id INT, purchase_date DATE, rented_copies INT). Consider only dates in calendar year 2024. Define a full week...
Find top 3 books by total borrowed time
Using copies(copy_id, book_id) and checkouts(copy_id, checkout_date, return_date), compute for each book_id the total borrowed duration as the sum ove...
Design a scalable dimensional model
Design a Dimensional Model for Transactional Analytics (Concrete Example Included) You are building a star-schema in a cloud data warehouse for near r...
Tackle Python tasks under time pressure
In a 15-minute coding round, implement a small Python function or class to solve a well-scoped problem within about 5 minutes of coding. 1) State 1–2 ...
Write Postgres string parsing and aggregation query
You are given a PostgreSQL table events(user_id TEXT, raw TEXT) where raw stores pipe-delimited key=value pairs, for example 'user=U1|country=US|ts=20...
Design tables for event-driven metrics
Design a Relational Schema for Consumer-App Event Analytics Context and Assumptions You are designing the event store for a high-volume consumer app. ...
Write PostgreSQL string-manipulation query
You are given a PostgreSQL table clickstream(session_id TEXT, page TEXT, query TEXT, ts TIMESTAMP). The page column contains full URLs like 'https://s...
Implement classes within an abstract Python framework
You are given an existing Python codebase (~200 lines shown) that defines an abstract base class DataProcessor with abstract methods load(self), trans...
Define and analyze product metrics
Product Analytics Case: Short‑Form Video Feed Context: You are evaluating a short‑form video feed feature inside a large social app where users swipe ...
Compute reservation diff for largest member
Given copies(copy_id, reserved_by_member_id) and members(member_id, referred_by_member_id), find the member with the largest member_id. Return a singl...
Compute missing letters to form original string
Implement a function that, given two strings original and typed (typed is a misspelled/partial version of original), returns the number of additional ...
Return count and renewal percentage of unreturned good copies
Tables: copies(copy_id, condition), checkouts(copy_id, checkout_date, return_date, renewal_count). Write a single SQL query that returns one row with ...
Solve Python and SQL data tasks
Complete both tasks: 1) Python: Implement a function flatten(nested) that takes a list whose elements are integers or arbitrarily nested lists of inte...
Define success metrics for a social feed
Define Success Metrics for a Social Feed Feature You are evaluating a change to the main social feed in a large-scale consumer app. Assume events are ...