Compute leakage-safe rolling features in pandas
Company: Freddie Mac
Role: Data Scientist
Category: Data Manipulation (SQL/Python)
Difficulty: Medium
Interview Round: Onsite
Using pandas on a 50M-row monthly panel with columns [loan_id, msa, month, property_type, delinquent_90dpd (0/1), upb], create features for each loan-month: (a) a 12-month rolling delinquency rate per (msa, property_type) that excludes the current month (strict t-1 window), (b) a target-encoded property_type delinquency rate per MSA using only data strictly before the current month (leave-one-time-step-out), and (c) an exponentially weighted default intensity per loan with half-life = 6 months. Return a DataFrame with one row per loan-month containing these features, leakage-safe. Explain how you would: (1) ensure memory efficiency under 16 GB RAM (categorical dtypes, downcasting, chunked joins, parquet scans), (2) guarantee time-order correctness after shuffles (sort indices, stable groupby; avoid groupby.apply pitfalls), and (3) unit test correctness with a small deterministic example covering edge cases (missing months, single-observation groups). Provide the big-O time/memory tradeoffs of your approach.
Quick Answer: This question evaluates proficiency in time-series feature engineering, leakage prevention, memory-efficient large-scale data processing, categorical target encoding, exponential weighting, and algorithmic complexity within the Data Manipulation (SQL/Python) domain, emphasizing practical application with pandas at scale.