Data Engineer Data Manipulation (SQL/Python) Interview Questions
Master your tech interview with our curated database of real questions from top companies.
Optimize SQL to minimize scans
Given a large analytics query, refactor it to minimize table scans. 1) Replace unnecessary CTEs that cause multiple scans with inline aggregations or ...
Debug a Hive Query for DAU
You are given two Hive tables: users(user_id BIGINT, created_at TIMESTAMP) and events(user_id BIGINT, event_time TIMESTAMP, event_name STRING) PARTITI...
Find top 3 books by total borrowed time
Using copies(copy_id, book_id) and checkouts(copy_id, checkout_date, return_date), compute for each book_id the total borrowed duration as the sum ove...
Write Postgres string parsing and aggregation query
You are given a PostgreSQL table events(user_id TEXT, raw TEXT) where raw stores pipe-delimited key=value pairs, for example 'user=U1|country=US|ts=20...
Find customer with max rentals in consecutive weeks
You are given a table purchases(customer_id INT, purchase_date DATE, rented_copies INT). Consider only dates in calendar year 2024. Define a full week...
Solve Python and SQL data tasks
Complete both tasks: 1) Python: Implement a function flatten(nested) that takes a list whose elements are integers or arbitrarily nested lists of inte...
Write PostgreSQL string-manipulation query
You are given a PostgreSQL table clickstream(session_id TEXT, page TEXT, query TEXT, ts TIMESTAMP). The page column contains full URLs like 'https://s...
Compute reservation diff for largest member
Given copies(copy_id, reserved_by_member_id) and members(member_id, referred_by_member_id), find the member with the largest member_id. Return a singl...
Return count and renewal percentage of unreturned good copies
Tables: copies(copy_id, condition), checkouts(copy_id, checkout_date, return_date, renewal_count). Write a single SQL query that returns one row with ...
Tackle Python tasks under time pressure
In a 15-minute coding round, implement a small Python function or class to solve a well-scoped problem within about 5 minutes of coding. 1) State 1–2 ...
Solve library SQL and Python tasks
You are given a library domain. Assume these tables: - books(book_id, author_id, title) - authors(author_id, name) - copies(copy_id, book_id, conditio...
Solve three SQL problems (easy/medium/hard)
Answer the following SQL tasks of increasing difficulty: 1) Aggregation: Given Orders(order_id, customer_id, amount, order_date), return each customer...
Write SQL for library analytics
Given a library database, write SQL to answer: 1) How many books are in good condition and not returned? Among these books, what is the percentage tha...
Aggregate Netflix metrics in SQL
Question Netflix video-streaming analytics SQL: Write a simple aggregation (e.g., total watch-time per day). Build a cumulative metric: today’s metric...
Query carpool ride metrics
Question For a ride-sharing product with carpool capability, answer a series of SQL questions (e.g., daily completed pooled rides, average seats utili...
Count active follow connections
Question Write SQL to return the current number of active follow connections. Events table columns: user_id, target_id, event_type ('request_follow', ...
Compute cumulative metrics with full joins
Tables: - daily_metrics(date DATE, content_id STRING, daily_value BIGINT) - cumulative_metrics(date DATE, content_id STRING, cumulative_value BIGINT) ...
Write SQL for Transactions and Customers
You are given two tables: - customers(customer_id INT PRIMARY KEY, signup_date DATE, region VARCHAR, segment VARCHAR) - transactions(transaction_id IN...
Recommend two-hop follows in Python
Given a directed "follows" graph as a Python dict[str, list[str]], implement recommend_two_hop(graph, user) that returns the set (or a sorted list) of...
Write SQL for active follow connections
Table: follow_events(requester_id INT, target_id INT, event STRING CHECK (event IN ('request_follow','follow_success','follow_reject','unfollow')), ev...