Design SQL cleaning, mapping, dedupe, and keying

Q: How do I practice SQL interview questions?

PracHub provides an interactive SQL console where you can write and test queries against real database schemas. Get instant feedback and compare your solution with the expected output.

Q: What difficulty level is this coding question?

This is a medium difficulty Data Manipulation (SQL/Python) question, commonly asked during Technical Screen rounds at Freddie Mac.

Q: What role is this question designed for?

This question is commonly asked for Data Scientist candidates at Freddie Mac during technical interviews.

Question

Assume PostgreSQL 15. You may choose a different SQL dialect if you clearly state it and adapt syntax accordingly.

- Table: orders(order_id INT PK, customer_id INT FK -> customers(customer_id), order_date DATE, amount NUMERIC(10,2))
 Rows:
 1001 | 101 | 2025-08-20 | 120.00
 1002 | 101 | 2025-08-21 | 50.00
 1003 | 102 | 2025-08-22 | 200.00

Answer the following. Be precise, handle edge cases, and justify design choices briefly.

a) Data cleaning: Write a single SQL statement (using CTEs allowed) that produces a cleaned result set from staging_customers with columns: email_norm, full_name_norm, created_at_ts, country_code. Rules: trim all whitespace, collapse internal double spaces in names, lowercase emails, remove trailing/leading punctuation, validate email by simple rule (must contain one '@' and at least one '.' after '@'; reject otherwise), standardize country to ISO-3166-1 alpha-2 with mappings {US, usa, United States -> US}, leave others as their two-letter code if already valid else NULL; parse created_at flexibly to TIMESTAMP, set to NULL if unparsable. Deduplicate by email_norm keeping the row with the most recent created_at_ts; if tie, keep the smallest stg_customer_id.

b) Data mapping spec: Provide a concise source-to-target mapping to load the cleaned result into customers, including for each target column: data type, source expression, transformation rule, and nullability/defaults (e.g., how updated_at is populated). Include at least one example row from the sample data showing the before/after values for Ana Li.

c) Find duplicates: 1) In customers, return any duplicate person records by case-insensitive, trimmed email (treat lower(trim(email)) as the uniqueness key). Show email_key and count, plus a sample customer_id list. 2) In staging_customers (after cleaning logic from part a), find records that would collide with existing customers on email_key and show both staging and customers identifiers side-by-side.

d) DDL vs DML: For each of the following, specify whether it’s DDL or DML, then write the exact SQL:
 - Enforce email uniqueness in customers on lower(trim(email)).
 - Add a foreign key from orders.customer_id to customers.customer_id with a deletion rule you choose (CASCADE or RESTRICT) and explain why.
 - Create any index(es) you deem necessary to support parts c) and f), and name them appropriately.

e) Self-join: Using orders, list customer_id pairs of consecutive-day orders by the same customer (o2.order_date = o1.order_date + INTERVAL '1 day'). Output: customer_id, first_order_id, first_date, next_order_id, next_date. Avoid duplicate pairs and ensure performance on large tables.

f) IF EXISTS…INSERT (upsert): Insert or update customers from the cleaned staging set. If email_key does not exist in customers, INSERT a new row (created_at from staging, country_code from cleaning). If it exists, UPDATE full_name if different (prefer the longest non-null normalized name), update country_code if NULL in customers and non-NULL in staging, and set updated_at = NOW(). Implement as a single statement (e.g., INSERT…ON CONFLICT for Postgres or MERGE for SQL Server). Explain how your approach avoids race conditions under concurrent loads (mention constraints and locking behavior).

PracHub · Accepted Answer

This question evaluates proficiency in data cleaning and normalization, deduplication, timestamp parsing, simple email validation, country code standardization, source-to-target mapping, and DDL/DML design using SQL (and optionally Python).

Quick Overview

Quick Overview