Data Engineer Data Manipulation (SQL/Python) Interview Questions
Practice the exact questions companies are asking right now.
Optimize SQL to minimize scans
Given a large analytics query, refactor it to minimize table scans. 1) Replace unnecessary CTEs that cause multiple scans with inline aggregations or ...
Solve Python and SQL data tasks
Complete both tasks: 1) Python: Implement a function flatten(nested) that takes a list whose elements are integers or arbitrarily nested lists of inte...
Compute reservation diff for largest member
Given copies(copy_id, reserved_by_member_id) and members(member_id, referred_by_member_id), find the member with the largest member_id. Return a singl...
Solve library SQL and Python tasks
You are given a library domain. Assume these tables: - books(book_id, author_id, title) - authors(author_id, name) - copies(copy_id, book_id, conditio...
Write Postgres string parsing and aggregation query
You are given a PostgreSQL table events(user_id TEXT, raw TEXT) where raw stores pipe-delimited key=value pairs, for example 'user=U1|country=US|ts=20...
Write PostgreSQL string-manipulation query
You are given a PostgreSQL table clickstream(session_id TEXT, page TEXT, query TEXT, ts TIMESTAMP). The page column contains full URLs like 'https://s...
Find top 3 books by total borrowed time
Using copies(copy_id, book_id) and checkouts(copy_id, checkout_date, return_date), compute for each book_id the total borrowed duration as the sum ove...
Return count and renewal percentage of unreturned good copies
Tables: copies(copy_id, condition), checkouts(copy_id, checkout_date, return_date, renewal_count). Write a single SQL query that returns one row with ...
Debug a Hive Query for DAU
You are given two Hive tables: users(user_id BIGINT, created_at TIMESTAMP) and events(user_id BIGINT, event_time TIMESTAMP, event_name STRING) PARTITI...
Tackle Python tasks under time pressure
In a 15-minute coding round, implement a small Python function or class to solve a well-scoped problem within about 5 minutes of coding. 1) State 1–2 ...
Find customer with max rentals in consecutive weeks
You are given a table purchases(customer_id INT, purchase_date DATE, rented_copies INT). Consider only dates in calendar year 2024. Define a full week...
Solve three SQL problems (easy/medium/hard)
Answer the following SQL tasks of increasing difficulty: 1) Aggregation: Given Orders(order_id, customer_id, amount, order_date), return each customer...
Write SQL for library analytics
Given a library database, write SQL to answer: 1) How many books are in good condition and not returned? Among these books, what is the percentage tha...
Aggregate Netflix metrics in SQL
Question Netflix video-streaming analytics SQL: Write a simple aggregation (e.g., total watch-time per day). Build a cumulative metric: today’s metric...
Count active follow connections
Question Write SQL to return the current number of active follow connections. Events table columns: user_id, target_id, event_type ('request_follow', ...
Query carpool ride metrics
Question For a ride-sharing product with carpool capability, answer a series of SQL questions (e.g., daily completed pooled rides, average seats utili...
Write SQL for Transactions and Customers
You are given two tables: - customers(customer_id INT PRIMARY KEY, signup_date DATE, region VARCHAR, segment VARCHAR) - transactions(transaction_id IN...
Compute cumulative metrics with full joins
Tables: - daily_metrics(date DATE, content_id STRING, daily_value BIGINT) - cumulative_metrics(date DATE, content_id STRING, cumulative_value BIGINT) ...
Return top-3 content per category
Given a collection of items with fields (content_id, category, rating), implement top_k_by_category(items, k= 3) that returns, for each category, the ...
Recommend two-hop follows in Python
Given a directed "follows" graph as a Python dict[str, list[str]], implement recommend_two_hop(graph, user) that returns the set (or a sorted list) of...