Design SQL cleaning, mapping, dedupe, and keying
Company: Freddie Mac
Role: Data Scientist
Category: Data Manipulation (SQL/Python)
Difficulty: Medium
Interview Round: Technical Screen
Assume PostgreSQL 15. You may choose a different SQL dialect if you clearly state it and adapt syntax accordingly.
Schema and small sample data:
- Table: staging_customers(stg_customer_id INT, full_name TEXT, email TEXT, created_at TEXT, country TEXT)
Rows:
1 | Ana Li | ana.li@example.com | 2025-08-28 10:12:00 | US
2 | Ana Li | ANA.LI@example.com | 2025-08-28 10:12 | usa
3 | Ben O'Neal | ben.oneal@example | 2025/08/30 | UK
4 | Cara-D | cara.d@example.com | 2025-08-31 | United States
5 | null | ana.li@example.com | 2025-08-29 | null
- Table: customers(customer_id INT PK, full_name TEXT, email TEXT, created_at TIMESTAMP, country_code CHAR(2), updated_at TIMESTAMP NULL)
Rows:
101 | Ana Li | ana.li@example.com | 2025-08-01 09:00:00 | US | null
102 | Dae Kim| dae.kim@example.com | 2025-08-15 12:00:00 | KR | null
- Table: orders(order_id INT PK, customer_id INT FK -> customers(customer_id), order_date DATE, amount NUMERIC(10,2))
Rows:
1001 | 101 | 2025-08-20 | 120.00
1002 | 101 | 2025-08-21 | 50.00
1003 | 102 | 2025-08-22 | 200.00
Answer the following. Be precise, handle edge cases, and justify design choices briefly.
a) Data cleaning: Write a single SQL statement (using CTEs allowed) that produces a cleaned result set from staging_customers with columns: email_norm, full_name_norm, created_at_ts, country_code. Rules: trim all whitespace, collapse internal double spaces in names, lowercase emails, remove trailing/leading punctuation, validate email by simple rule (must contain one '@' and at least one '.' after '@'; reject otherwise), standardize country to ISO-3166-1 alpha-2 with mappings {US, usa, United States -> US}, leave others as their two-letter code if already valid else NULL; parse created_at flexibly to TIMESTAMP, set to NULL if unparsable. Deduplicate by email_norm keeping the row with the most recent created_at_ts; if tie, keep the smallest stg_customer_id.
b) Data mapping spec: Provide a concise source-to-target mapping to load the cleaned result into customers, including for each target column: data type, source expression, transformation rule, and nullability/defaults (e.g., how updated_at is populated). Include at least one example row from the sample data showing the before/after values for Ana Li.
c) Find duplicates: 1) In customers, return any duplicate person records by case-insensitive, trimmed email (treat lower(trim(email)) as the uniqueness key). Show email_key and count, plus a sample customer_id list. 2) In staging_customers (after cleaning logic from part a), find records that would collide with existing customers on email_key and show both staging and customers identifiers side-by-side.
d) DDL vs DML: For each of the following, specify whether it’s DDL or DML, then write the exact SQL:
- Enforce email uniqueness in customers on lower(trim(email)).
- Add a foreign key from orders.customer_id to customers.customer_id with a deletion rule you choose (CASCADE or RESTRICT) and explain why.
- Create any index(es) you deem necessary to support parts c) and f), and name them appropriately.
e) Self-join: Using orders, list customer_id pairs of consecutive-day orders by the same customer (o2.order_date = o1.order_date + INTERVAL '1 day'). Output: customer_id, first_order_id, first_date, next_order_id, next_date. Avoid duplicate pairs and ensure performance on large tables.
f) IF EXISTS…INSERT (upsert): Insert or update customers from the cleaned staging set. If email_key does not exist in customers, INSERT a new row (created_at from staging, country_code from cleaning). If it exists, UPDATE full_name if different (prefer the longest non-null normalized name), update country_code if NULL in customers and non-NULL in staging, and set updated_at = NOW(). Implement as a single statement (e.g., INSERT…ON CONFLICT for Postgres or MERGE for SQL Server). Explain how your approach avoids race conditions under concurrent loads (mention constraints and locking behavior).
Quick Answer: This question evaluates proficiency in data cleaning and normalization, deduplication, timestamp parsing, simple email validation, country code standardization, source-to-target mapping, and DDL/DML design using SQL (and optionally Python).