PracHub
QuestionsPremiumLearningGuidesInterview PrepNEWCoaches

Quick Overview

This question evaluates skills in data cleaning, deduplication, schema mapping, referential integrity, DDL/DML, and SQL/Python data manipulation, focusing on joins, keys, validation of values, and ETL mapping competencies.

  • Medium
  • Freddie Mac
  • Data Manipulation (SQL/Python)
  • Data Scientist

Assess SQL cleaning, mapping, joins, keys, and DDL/DML

Company: Freddie Mac

Role: Data Scientist

Category: Data Manipulation (SQL/Python)

Difficulty: Medium

Interview Round: Technical Screen

You inherit a small retail analytics warehouse. Use the schema and sample rows below to answer all parts. Schema and sample data (minimal rows shown): Customers +-------------+----------------------+---------------------+-------------+ | customer_id | email | created_at | referrer_id | +-------------+----------------------+---------------------+-------------+ | 101 | a@x.com | 2025-07-11 09:15:00 | null | | 102 | b@y.com | 2025-07-12 10:20:00 | 101 | | 103 | a@x.com | 2025-07-12 10:21:00 | null | +-------------+----------------------+---------------------+-------------+ Orders +----------+-------------+---------------------+--------+--------------+ | order_id | customer_id | order_date | status | total_amount | +----------+-------------+---------------------+--------+--------------+ | 5001 | 101 | 2025-08-01 12:00:00 | paid | 49.99 | | 5002 | 101 | 2025-08-01 12:00:10 | paid | 49.99 | | 5003 | 103 | 2025-08-02 09:00:00 | cancel | 15.00 | +----------+-------------+---------------------+--------+--------------+ OrderItems +----------+---------+------------+-----+------------+ | order_id | line_no | product_id | qty | unit_price | +----------+---------+------------+-----+------------+ | 5001 | 1 | 9001 | 1 | 49.99 | | 5002 | 1 | 9001 | 1 | 49.99 | +----------+---------+------------+-----+------------+ Products +------------+-------+----------------+--------+ | product_id | sku | name | price | +------------+-------+----------------+--------+ | 9001 | SKU-1 | Widget Basic | 49.99 | | 9002 | SKU-1 | Widget Basic | 49.99 | +------------+-------+----------------+--------+ Answer all parts: 1) Data cleaning: Outline a concrete, ordered plan to clean this warehouse. Include: (a) deduplicating Customers by email while preserving the most trustworthy record (define and justify tie-breakers), (b) deduplicating Orders that appear to be ingestion duplicates (define a natural key and a time/amount tolerance you would use), (c) standardizing data types/time zones, validating referential integrity across all tables, and detecting impossible values (e.g., negative qty). For at least two steps, provide representative SQL snippets (or Python+pandas) you would run. 2) Data mapping: You must onboard a CSV web_orders_2025-08.csv with columns (order_id, order_ts, customer_email, sku, qty, price_cents). List exactly what fields you would include in a formal data mapping specification to land this file into Orders and OrderItems (e.g., source column, target table.column, data type, transformation logic, constraints, null/default rules, semantics/definitions, lineage, owner, quality tests, sample values, refresh cadence). Then, map each CSV column to its target columns and required transforms for this schema. 3) Find duplicates: (a) Write a query that flags duplicate customers by email and returns only the row you would keep per email with a reason code for why others were dropped. (b) Write a query that returns likely duplicate orders for the same customer when order_date is within 30 seconds and total_amount matches to 2 decimal places, grouped into clusters with a stable surrogate duplicate_group_id. (c) Write a query to detect duplicate Products by sku and pick the survivor deterministically. 4) DDL vs DML: Explain the practical differences between DDL and DML in this scenario, including transactional and locking implications. Provide one DDL statement you would actually run here to enforce integrity and one DML statement that demonstrates data manipulation on these tables. 5) Self join: Using Customers, return each customer_id, email, and the referrer’s email (if any). Ensure customers with no referrer still appear exactly once. 6) Conditional insert (IF NOT EXISTS…INSERT/UPSERT): Show how you would insert or update a Product by sku so that duplicate SKUs like SKU-1 collapse into a single correct row without violating constraints. Include how you would handle concurrent writes. 7) Keys and constraints: Define appropriate PRIMARY KEY, FOREIGN KEY, and UNIQUE constraints for all four tables given the sample data. Justify any composite keys (e.g., OrderItems) and discuss trade-offs of surrogate vs natural keys for Customers and Orders. 8) Describe your SQL skill: Rate your SQL proficiency from 1–10 and justify with two concrete examples from the tasks above (e.g., a window-function-based dedupe and a concurrency-safe upsert), including how you would test and optimize them.

Quick Answer: This question evaluates skills in data cleaning, deduplication, schema mapping, referential integrity, DDL/DML, and SQL/Python data manipulation, focusing on joins, keys, validation of values, and ETL mapping competencies.

Last updated: Mar 29, 2026

Related Coding Questions

  • Compute leakage-safe rolling features in pandas - Freddie Mac (Medium)
  • Design SQL cleaning, mapping, dedupe, and keying - Freddie Mac (Medium)

Loading coding console...

PracHub

Master your tech interviews with 7,500+ real questions from top companies.

Product

  • Questions
  • Learning Tracks
  • Interview Guides
  • Resources
  • Premium
  • For Universities
  • Student Access

Browse

  • By Company
  • By Role
  • By Category
  • Topic Hubs
  • SQL Questions
  • Compare Platforms
  • Discord Community

Support

  • support@prachub.com
  • (916) 541-4762

Legal

  • Privacy Policy
  • Terms of Service
  • About Us

© 2026 PracHub. All rights reserved.