This question evaluates proficiency in data cleaning (type conversion), missing-data handling, group-period aggregation, and estimating treatment effects via difference-in-differences.
You are given a pandas DataFrame df with the following columns:
unit_id
(string): entity identifier (e.g., user, city, driver)
group
(string): either
'treatment'
or
'control'
period
(string): either
'pre'
or
'post'
y
(string): outcome stored as a string (should be numeric), with
exactly one missing value
(NaN)
Tasks:
y
from string to integer (assume all non-missing values are valid integer strings, e.g.
'12'
).
y
using the
simple (unconditional) average
of the non-missing
y
values.
y
:
Return the scalar DiD estimate (and optionally the intermediate group-period means used).