Transform DataFrame and compute diff-in-diff
Company: Uber
Role: Data Scientist
Category: Data Manipulation (SQL/Python)
Difficulty: easy
Interview Round: Technical Screen
You are given a pandas DataFrame `df` with the following columns:
- `unit_id` (string): entity identifier (e.g., user, city, driver)
- `group` (string): either `'treatment'` or `'control'`
- `period` (string): either `'pre'` or `'post'`
- `y` (string): outcome stored as a string (should be numeric), with **exactly one missing value** (NaN)
Tasks:
1. Convert `y` from string to integer (assume all non-missing values are valid integer strings, e.g. `'12'`).
2. Impute the missing value in `y` using the **simple (unconditional) average** of the non-missing `y` values.
3. After steps (1)–(2), compute the **difference-in-differences (DiD)** estimate of the treatment effect on `y`:
\[
\text{DiD} = (\overline{y}_{\text{treat, post}} - \overline{y}_{\text{treat, pre}}) - (\overline{y}_{\text{ctrl, post}} - \overline{y}_{\text{ctrl, pre}})
\]
Return the scalar DiD estimate (and optionally the intermediate group-period means used).
Quick Answer: This question evaluates proficiency in data cleaning (type conversion), missing-data handling, group-period aggregation, and estimating treatment effects via difference-in-differences.