This multi-part question evaluates a data scientist's competencies in statistical inference (sample size and z-test calculations), causal inference and parallel-trends validation (difference-in-differences), Bayesian probability updating, and interpretable supervised learning feature importance, all framed as coding tasks.
You are completing a CodeSignal-style assessment (Python or R). Implement solutions for the following four independent questions.
You are given:
x
: numeric array of historical observations for the metric (use it to estimate the metric standard deviation
sigma
)
alpha
: significance level (e.g., 0.05)
power
: desired power (e.g., 0.8)
effect_size
: the minimum detectable absolute difference in means,
Assumptions:
Task:
n
required to detect
effect_size
at level
alpha
with
power
.
You are given three equal-length arrays:
period[i]
: time indicator (contains at least a “pre” and a “post” period; may contain multiple pre periods)
group[i]
: 0 = control, 1 = treatment
outcome[i]
: numeric outcome
And a numeric threshold for trend validation.
Definitions:
Parallel-trend / trend validation requirement:
Task:
threshold
.
You are given probabilities (as floats) describing an event and evidence , such as:
p_A
=
p_B_given_A
=
p_B_given_not_A
=
Task:
You are given:
X
: a 2D array where each
row corresponds to one feature
and each
column corresponds to one observation
(shape:
num_features × num_samples
)
y
: binary outcome array of length
num_samples
(values in {0,1})
feature_names
: array of length
num_features
Task:
y
from
X
(include an intercept).
Notes: