Assess SQL cleaning, mapping, joins, keys, and DDL/DML
Company: Freddie Mac
Role: Data Scientist
Category: Data Manipulation (SQL/Python)
Difficulty: Medium
Interview Round: Technical Screen
You inherit a small retail analytics warehouse. Use the schema and sample rows below to answer all parts.
Schema and sample data (minimal rows shown):
Customers
+-------------+----------------------+---------------------+-------------+
| customer_id | email | created_at | referrer_id |
+-------------+----------------------+---------------------+-------------+
| 101 | a@x.com | 2025-07-11 09:15:00 | null |
| 102 | b@y.com | 2025-07-12 10:20:00 | 101 |
| 103 | a@x.com | 2025-07-12 10:21:00 | null |
+-------------+----------------------+---------------------+-------------+
Orders
+----------+-------------+---------------------+--------+--------------+
| order_id | customer_id | order_date | status | total_amount |
+----------+-------------+---------------------+--------+--------------+
| 5001 | 101 | 2025-08-01 12:00:00 | paid | 49.99 |
| 5002 | 101 | 2025-08-01 12:00:10 | paid | 49.99 |
| 5003 | 103 | 2025-08-02 09:00:00 | cancel | 15.00 |
+----------+-------------+---------------------+--------+--------------+
OrderItems
+----------+---------+------------+-----+------------+
| order_id | line_no | product_id | qty | unit_price |
+----------+---------+------------+-----+------------+
| 5001 | 1 | 9001 | 1 | 49.99 |
| 5002 | 1 | 9001 | 1 | 49.99 |
+----------+---------+------------+-----+------------+
Products
+------------+-------+----------------+--------+
| product_id | sku | name | price |
+------------+-------+----------------+--------+
| 9001 | SKU-1 | Widget Basic | 49.99 |
| 9002 | SKU-1 | Widget Basic | 49.99 |
+------------+-------+----------------+--------+
Answer all parts:
1) Data cleaning: Outline a concrete, ordered plan to clean this warehouse. Include: (a) deduplicating Customers by email while preserving the most trustworthy record (define and justify tie-breakers), (b) deduplicating Orders that appear to be ingestion duplicates (define a natural key and a time/amount tolerance you would use), (c) standardizing data types/time zones, validating referential integrity across all tables, and detecting impossible values (e.g., negative qty). For at least two steps, provide representative SQL snippets (or Python+pandas) you would run.
2) Data mapping: You must onboard a CSV web_orders_2025-08.csv with columns (order_id, order_ts, customer_email, sku, qty, price_cents). List exactly what fields you would include in a formal data mapping specification to land this file into Orders and OrderItems (e.g., source column, target table.column, data type, transformation logic, constraints, null/default rules, semantics/definitions, lineage, owner, quality tests, sample values, refresh cadence). Then, map each CSV column to its target columns and required transforms for this schema.
3) Find duplicates: (a) Write a query that flags duplicate customers by email and returns only the row you would keep per email with a reason code for why others were dropped. (b) Write a query that returns likely duplicate orders for the same customer when order_date is within 30 seconds and total_amount matches to 2 decimal places, grouped into clusters with a stable surrogate duplicate_group_id. (c) Write a query to detect duplicate Products by sku and pick the survivor deterministically.
4) DDL vs DML: Explain the practical differences between DDL and DML in this scenario, including transactional and locking implications. Provide one DDL statement you would actually run here to enforce integrity and one DML statement that demonstrates data manipulation on these tables.
5) Self join: Using Customers, return each customer_id, email, and the referrer’s email (if any). Ensure customers with no referrer still appear exactly once.
6) Conditional insert (IF NOT EXISTS…INSERT/UPSERT): Show how you would insert or update a Product by sku so that duplicate SKUs like SKU-1 collapse into a single correct row without violating constraints. Include how you would handle concurrent writes.
7) Keys and constraints: Define appropriate PRIMARY KEY, FOREIGN KEY, and UNIQUE constraints for all four tables given the sample data. Justify any composite keys (e.g., OrderItems) and discuss trade-offs of surrogate vs natural keys for Customers and Orders.
8) Describe your SQL skill: Rate your SQL proficiency from 1–10 and justify with two concrete examples from the tasks above (e.g., a window-function-based dedupe and a concurrency-safe upsert), including how you would test and optimize them.
Quick Answer: This question evaluates skills in data cleaning, deduplication, schema mapping, referential integrity, DDL/DML, and SQL/Python data manipulation, focusing on joins, keys, validation of values, and ETL mapping competencies.