Compute leakage-safe rolling features in pandas

Q: How do I practice SQL interview questions?

PracHub provides an interactive SQL console where you can write and test queries against real database schemas. Get instant feedback and compare your solution with the expected output.

Q: What difficulty level is this coding question?

This is a medium difficulty Data Manipulation (SQL/Python) question, commonly asked during Onsite rounds at Freddie Mac.

Q: What role is this question designed for?

This question is commonly asked for Data Scientist candidates at Freddie Mac during technical interviews.

Question

Using pandas on a 50M-row monthly panel with columns [loan_id, msa, month, property_type, delinquent_90dpd (0/1), upb], create features for each loan-month: (a) a 12-month rolling delinquency rate per (msa, property_type) that excludes the current month (strict t-1 window), (b) a target-encoded property_type delinquency rate per MSA using only data strictly before the current month (leave-one-time-step-out), and (c) an exponentially weighted default intensity per loan with half-life = 6 months. Return a DataFrame with one row per loan-month containing these features, leakage-safe. Explain how you would: (1) ensure memory efficiency under 16 GB RAM (categorical dtypes, downcasting, chunked joins, parquet scans), (2) guarantee time-order correctness after shuffles (sort indices, stable groupby; avoid groupby.apply pitfalls), and (3) unit test correctness with a small deterministic example covering edge cases (missing months, single-observation groups). Provide the big-O time/memory tradeoffs of your approach.

PracHub · Accepted Answer

This question evaluates proficiency in time-series feature engineering, leakage prevention, memory-efficient large-scale data processing, categorical target encoding, exponential weighting, and algorithmic complexity within the Data Manipulation (SQL/Python) domain, emphasizing practical application with pandas at scale.

Quick Overview

Quick Overview