Implement Python Function for Statistical Test P-Values
Company: Roblox
Role: Data Scientist
Category: Coding & Algorithms
Difficulty: Medium
Interview Round: Technical Screen
##### Scenario
You need a utility that calculates p-values for one-sided and two-sided statistical tests.
##### Question
Write a Python function `compute_p_value(stat, dist='z', df=None, alternative='two-sided')` that returns the p-value. Your code should support Z-tests and Student-t tests, and handle 'less', 'greater', and 'two-sided' alternatives.
##### Hints
Use the CDF of the chosen distribution; for two-sided tests return 2*min(CDF, 1-CDF). Libraries like scipy.stats are allowed.
Quick Answer: This question evaluates proficiency in statistical hypothesis testing, p-value interpretation, and implementing distribution-based calculations (Z and Student-t) in code, and it falls under the Coding & Algorithms domain for data scientist roles.
Implement compute_p_value(stat, dist='z', df=None, alternative='two-sided') that returns the p-value for one-sided and two-sided tests. For dist='z', use the standard normal distribution. For dist='t', use the Student's t distribution with degrees of freedom df (a positive integer). The alternative can be 'less', 'greater', or 'two-sided'. Compute p-values using the CDF of the chosen distribution: for 'less' return CDF(stat); for 'greater' return 1 - CDF(stat); for 'two-sided' return 2 * min(CDF(stat), 1 - CDF(stat)). Do not use external libraries. Return a float in [0, 1].
Constraints
- dist is 'z' or 't'
- alternative is 'less', 'greater', or 'two-sided'
- For dist='t', df is a positive integer (1 <= df <= 10^6)
- stat is a finite float (|stat| <= 1e6)
- Use only the Python standard library
- Return value within absolute error 1e-9 of the true p-value
Examples
Input:
Expected Output: 1.0
Input:
Expected Output: 0.75
Solution
def compute_p_value(stat, dist='z', df=None, alternative='two-sided'):
import math
def _norm_cdf(z):
if z == math.inf:
return 1.0
if z == -math.inf:
return 0.0
return 0.5 * (1.0 + math.erf(z / math.sqrt(2.0)))
def _betacf(a, b, x):
MAXIT = 200
EPS = 3e-14
FPMIN = 1e-300
qab = a + b
qap = a + 1.0
qam = a - 1.0
c = 1.0
d = 1.0 - qab * x / qap
if abs(d) < FPMIN:
d = FPMIN
d = 1.0 / d
h = d
for m in range(1, MAXIT + 1):
m2 = 2 * m
aa = m * (b - m) * x / ((qam + m2) * (a + m2))
d = 1.0 + aa * d
if abs(d) < FPMIN:
d = FPMIN
c = 1.0 + aa / c
if abs(c) < FPMIN:
c = FPMIN
d = 1.0 / d
h *= d * c
aa = -(a + m) * (qab + m) * x / ((a + m2) * (qap + m2))
d = 1.0 + aa * d
if abs(d) < FPMIN:
d = FPMIN
c = 1.0 + aa / c
if abs(c) < FPMIN:
c = FPMIN
d = 1.0 / d
delh = d * c
h *= delh
if abs(delh - 1.0) < EPS:
break
return h
def _betainc_reg(a, b, x):
if x <= 0.0:
return 0.0
if x >= 1.0:
return 1.0
ln_bt = math.lgamma(a + b) - math.lgamma(a) - math.lgamma(b) + a * math.log(x) + b * math.log(1.0 - x)
bt = math.exp(ln_bt)
if x < (a + 1.0) / (a + b + 2.0):
return bt * _betacf(a, b, x) / a
else:
return 1.0 - bt * _betacf(b, a, 1.0 - x) / b
def _t_cdf(t, nu):
if not math.isfinite(t):
return 1.0 if t > 0 else 0.0
x = nu / (nu + t * t)
a = nu / 2.0
b = 0.5
ib = _betainc_reg(a, b, x)
if t >= 0:
return 1.0 - 0.5 * ib
else:
return 0.5 * ib
d = (dist or 'z').lower()
alt = (alternative or 'two-sided').lower().replace('_', '-')
if d not in ('z', 't'):
raise ValueError('dist must be "z" or "t"')
if alt not in ('less', 'greater', 'two-sided'):
raise ValueError('alternative must be "less", "greater", or "two-sided"')
if d == 'z':
F = _norm_cdf(float(stat))
else:
if df is None or int(df) != df or int(df) <= 0:
raise ValueError('df must be a positive integer for t distribution')
F = _t_cdf(float(stat), int(df))
if alt == 'less':
p = F
elif alt == 'greater':
p = 1.0 - F
else:
p = 2.0 * (F if F < 0.5 else 1.0 - F)
if p < 0.0:
p = 0.0
elif p > 1.0:
p = 1.0
return p