PracHub
QuestionsPremiumLearningGuidesInterview PrepNEWCoaches

Quick Overview

This question evaluates proficiency in data cleaning and normalization, deduplication, timestamp parsing, simple email validation, country code standardization, source-to-target mapping, and DDL/DML design using SQL (and optionally Python).

  • Medium
  • Freddie Mac
  • Data Manipulation (SQL/Python)
  • Data Scientist

Design SQL cleaning, mapping, dedupe, and keying

Company: Freddie Mac

Role: Data Scientist

Category: Data Manipulation (SQL/Python)

Difficulty: Medium

Interview Round: Technical Screen

Assume PostgreSQL 15. You may choose a different SQL dialect if you clearly state it and adapt syntax accordingly. Schema and small sample data: - Table: staging_customers(stg_customer_id INT, full_name TEXT, email TEXT, created_at TEXT, country TEXT) Rows: 1 | Ana Li | ana.li@example.com | 2025-08-28 10:12:00 | US 2 | Ana Li | ANA.LI@example.com | 2025-08-28 10:12 | usa 3 | Ben O'Neal | ben.oneal@example | 2025/08/30 | UK 4 | Cara-D | cara.d@example.com | 2025-08-31 | United States 5 | null | ana.li@example.com | 2025-08-29 | null - Table: customers(customer_id INT PK, full_name TEXT, email TEXT, created_at TIMESTAMP, country_code CHAR(2), updated_at TIMESTAMP NULL) Rows: 101 | Ana Li | ana.li@example.com | 2025-08-01 09:00:00 | US | null 102 | Dae Kim| dae.kim@example.com | 2025-08-15 12:00:00 | KR | null - Table: orders(order_id INT PK, customer_id INT FK -> customers(customer_id), order_date DATE, amount NUMERIC(10,2)) Rows: 1001 | 101 | 2025-08-20 | 120.00 1002 | 101 | 2025-08-21 | 50.00 1003 | 102 | 2025-08-22 | 200.00 Answer the following. Be precise, handle edge cases, and justify design choices briefly. a) Data cleaning: Write a single SQL statement (using CTEs allowed) that produces a cleaned result set from staging_customers with columns: email_norm, full_name_norm, created_at_ts, country_code. Rules: trim all whitespace, collapse internal double spaces in names, lowercase emails, remove trailing/leading punctuation, validate email by simple rule (must contain one '@' and at least one '.' after '@'; reject otherwise), standardize country to ISO-3166-1 alpha-2 with mappings {US, usa, United States -> US}, leave others as their two-letter code if already valid else NULL; parse created_at flexibly to TIMESTAMP, set to NULL if unparsable. Deduplicate by email_norm keeping the row with the most recent created_at_ts; if tie, keep the smallest stg_customer_id. b) Data mapping spec: Provide a concise source-to-target mapping to load the cleaned result into customers, including for each target column: data type, source expression, transformation rule, and nullability/defaults (e.g., how updated_at is populated). Include at least one example row from the sample data showing the before/after values for Ana Li. c) Find duplicates: 1) In customers, return any duplicate person records by case-insensitive, trimmed email (treat lower(trim(email)) as the uniqueness key). Show email_key and count, plus a sample customer_id list. 2) In staging_customers (after cleaning logic from part a), find records that would collide with existing customers on email_key and show both staging and customers identifiers side-by-side. d) DDL vs DML: For each of the following, specify whether it’s DDL or DML, then write the exact SQL: - Enforce email uniqueness in customers on lower(trim(email)). - Add a foreign key from orders.customer_id to customers.customer_id with a deletion rule you choose (CASCADE or RESTRICT) and explain why. - Create any index(es) you deem necessary to support parts c) and f), and name them appropriately. e) Self-join: Using orders, list customer_id pairs of consecutive-day orders by the same customer (o2.order_date = o1.order_date + INTERVAL '1 day'). Output: customer_id, first_order_id, first_date, next_order_id, next_date. Avoid duplicate pairs and ensure performance on large tables. f) IF EXISTS…INSERT (upsert): Insert or update customers from the cleaned staging set. If email_key does not exist in customers, INSERT a new row (created_at from staging, country_code from cleaning). If it exists, UPDATE full_name if different (prefer the longest non-null normalized name), update country_code if NULL in customers and non-NULL in staging, and set updated_at = NOW(). Implement as a single statement (e.g., INSERT…ON CONFLICT for Postgres or MERGE for SQL Server). Explain how your approach avoids race conditions under concurrent loads (mention constraints and locking behavior).

Quick Answer: This question evaluates proficiency in data cleaning and normalization, deduplication, timestamp parsing, simple email validation, country code standardization, source-to-target mapping, and DDL/DML design using SQL (and optionally Python).

Last updated: Mar 29, 2026

Related Coding Questions

  • Compute leakage-safe rolling features in pandas - Freddie Mac (Medium)
  • Assess SQL cleaning, mapping, joins, keys, and DDL/DML - Freddie Mac (Medium)

Loading coding console...

PracHub

Master your tech interviews with 7,500+ real questions from top companies.

Product

  • Questions
  • Learning Tracks
  • Interview Guides
  • Resources
  • Premium
  • For Universities
  • Student Access

Browse

  • By Company
  • By Role
  • By Category
  • Topic Hubs
  • SQL Questions
  • Compare Platforms
  • Discord Community

Support

  • support@prachub.com
  • (916) 541-4762

Legal

  • Privacy Policy
  • Terms of Service
  • About Us

© 2026 PracHub. All rights reserved.