Implement four DS coding tasks
Company: Roblox
Role: Data Scientist
Category: Coding & Algorithms
Difficulty: easy
Interview Round: Take-home Project
Quick Answer: This multi-part question evaluates a data scientist's competencies in statistical inference (sample size and z-test calculations), causal inference and parallel-trends validation (difference-in-differences), Bayesian probability updating, and interpretable supervised learning feature importance, all framed as coding tasks.
Part 1: Two-Sample Z-Test Required Sample Size
Constraints
- 1 <= len(x) <= 10^5
- 0 < alpha < 1
- 0 < power < 1
- effect_size > 0
- Use the population standard deviation: sqrt(sum((xi - mean)^2) / len(x))
Examples
Input: ([10, 12, 14, 16], 0.05, 0.8, 2.0)
Expected Output: 20
Explanation: The population variance is 5, so sigma = sqrt(5). Plugging into the formula gives about 19.62, so the answer is 20.
Input: ([1, 2, 3, 4, 5], 0.1, 0.9, 1.5)
Expected Output: 16
Explanation: The population variance is 2. The computed n is about 15.22, so the minimum integer per-group size is 16.
Input: ([7], 0.05, 0.8, 0.5)
Expected Output: 1
Explanation: Edge case: a single historical value gives zero estimated variance. The implementation returns the smallest valid per-group size, 1.
Input: ([1, 1, 1, 2], 0.05, 0.8, 0.1)
Expected Output: 295
Explanation: A very small detectable effect size requires a much larger sample size.
Solution
def solution(x, alpha, power, effect_size):
import math
from statistics import NormalDist
mean_x = sum(x) / len(x)
variance = sum((v - mean_x) ** 2 for v in x) / len(x)
sigma = math.sqrt(variance)
if sigma == 0:
return 1
z_alpha = NormalDist().inv_cdf(1 - alpha / 2.0)
z_power = NormalDist().inv_cdf(power)
n = 2.0 * (sigma ** 2) * (z_alpha + z_power) ** 2 / (effect_size ** 2)
return max(1, math.ceil(n))
Time complexity: O(n). Space complexity: O(1).
Hints
- For equal-sized groups in a two-sample z-test, the standard formula is n = 2 * sigma^2 * (z_(1-alpha/2) + z_power)^2 / effect_size^2.
- After computing the real-valued sample size, take the ceiling because you need the minimum integer that still satisfies the requirement.
Part 2: Difference-in-Differences with Pre-Trend Validation
Constraints
- len(period) == len(group) == len(outcome)
- 4 <= len(period) <= 10^5
- group[i] is either 0 or 1
- Each group has at least one observation in the overall pre bucket and the overall post bucket
- For every distinct pre period used in trend validation, both groups appear at least once
Examples
Input: (['pre', 'pre', 'post', 'post'], [0, 1, 0, 1], [10, 12, 11, 15], 0.5)
Expected Output: (2.0, True)
Explanation: Single pre period, so trend validation automatically passes. DiD = (15 - 12) - (11 - 10) = 2.
Input: (['pre1', 'pre1', 'pre2', 'pre2', 'post', 'post'], [0, 1, 0, 1, 0, 1], [10, 12, 11, 13, 12, 17], 0.1)
Expected Output: (3.0, True)
Explanation: Pre-period differences are 2 and 2, so the maximum change is 0 <= 0.1. The DiD estimate is 3.0.
Input: (['pre1', 'pre1', 'pre2', 'pre2', 'post', 'post'], [0, 1, 0, 1, 0, 1], [10, 12, 10, 15, 12, 18], 2.0)
Expected Output: (2.5, False)
Explanation: Pre-period differences are 2 and 5, so the change is 3, which exceeds the threshold 2.0.
Input: (['pre1', 'pre1', 'pre1', 'pre1', 'pre2', 'pre2', 'pre2', 'pre2', 'post', 'post', 'post', 'post'], [0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 1, 1], [4, 6, 5, 7, 5, 5, 6, 8, 6, 6, 9, 11], 1.0)
Expected Output: (2.5, True)
Explanation: Edge case with multiple observations per cell. The pre-trend change is exactly 1.0, so it passes.
Solution
def solution(period, group, outcome, threshold):
import re
from collections import defaultdict
def bucket(label):
s = str(label).lower()
if s.startswith('pre'):
return 'pre'
return 'post'
def pre_order(label):
s = str(label).lower()
match = re.search(r'(-?\d+)$', s)
num = int(match.group(1)) if match else 0
return (num, s)
overall_sum = defaultdict(float)
overall_count = defaultdict(int)
cell_sum = defaultdict(float)
cell_count = defaultdict(int)
pre_labels = set()
for t, g, y in zip(period, group, outcome):
b = bucket(t)
overall_sum[(g, b)] += y
overall_count[(g, b)] += 1
if b == 'pre':
cell_sum[(g, t)] += y
cell_count[(g, t)] += 1
pre_labels.add(t)
mean_treat_post = overall_sum[(1, 'post')] / overall_count[(1, 'post')]
mean_treat_pre = overall_sum[(1, 'pre')] / overall_count[(1, 'pre')]
mean_ctrl_post = overall_sum[(0, 'post')] / overall_count[(0, 'post')]
mean_ctrl_pre = overall_sum[(0, 'pre')] / overall_count[(0, 'pre')]
did = (mean_treat_post - mean_treat_pre) - (mean_ctrl_post - mean_ctrl_pre)
ordered_pre = sorted(pre_labels, key=pre_order)
if len(ordered_pre) <= 1:
return (did, True)
diffs = []
for t in ordered_pre:
treat_mean = cell_sum[(1, t)] / cell_count[(1, t)]
ctrl_mean = cell_sum[(0, t)] / cell_count[(0, t)]
diffs.append(treat_mean - ctrl_mean)
max_change = 0.0
for i in range(1, len(diffs)):
max_change = max(max_change, abs(diffs[i] - diffs[i - 1]))
return (did, max_change <= threshold + 1e-12)
Time complexity: O(n + k log k), where k is the number of distinct pre periods. Space complexity: O(k).
Hints
- First compute overall means for treatment/control in the combined pre and combined post buckets to get the DiD estimate.
- For trend validation, compute treatment minus control separately for each distinct pre period, sort the pre periods by time order, then look at consecutive changes.
Part 3: Bayes' Rule Posterior Probability
Constraints
- 0 <= p_A <= 1
- 0 <= p_B_given_A <= 1
- 0 <= p_B_given_not_A <= 1
- p_B_given_A * p_A + p_B_given_not_A * (1 - p_A) > 0
Examples
Input: (0.01, 0.9, 0.05)
Expected Output: 0.15384615384615385
Explanation: A standard rare-event example: even with strong evidence, the base rate matters.
Input: (0.3, 0.8, 0.2)
Expected Output: 0.631578947368421
Explanation: Numerator = 0.24 and denominator = 0.38.
Input: (0.0, 0.9, 0.1)
Expected Output: 0.0
Explanation: Edge case: if P(A) = 0, then P(A|B) must also be 0.
Input: (1.0, 0.7, 0.2)
Expected Output: 1.0
Explanation: Edge case: if P(A) = 1, then P(A|B) must also be 1.
Solution
def solution(p_A, p_B_given_A, p_B_given_not_A):
numerator = p_B_given_A * p_A
denominator = numerator + p_B_given_not_A * (1.0 - p_A)
return numerator / denominator
Time complexity: O(1). Space complexity: O(1).
Hints
- Start from Bayes' rule: P(A|B) = P(B|A)P(A) / P(B).
- Compute P(B) by splitting on whether A happens or not.
Part 4: Logistic Regression Top-3 Features
Constraints
- 3 <= number of features <= 20
- 1 <= number of samples <= 500
- len(feature_names) == number of features
- Each row in X has length equal to len(y)
- y contains only 0 and 1
- y contains at least one 0 and at least one 1
Examples
Input: ([[0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1]], [0, 0, 0, 0, 1, 1, 1, 1, 0, 1, 1, 1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1], ['A', 'B', 'C', 'D'])
Expected Output: ['A', 'B', 'C']
Explanation: The grouped synthetic data is built so feature A has the strongest positive effect, followed by B, then C.
Input: ([[0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1]], [0, 0, 0, 1, 1, 1, 1, 0, 1, 1, 1, 0, 1, 1, 0, 0, 0, 0, 0, 1], ['beta', 'alpha', 'gamma', 'delta'])
Expected Output: ['alpha', 'beta', 'gamma']
Explanation: Alpha and beta are tied in strength, so the tie is broken alphabetically by feature name.
Input: ([[0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1]], [0, 0, 0, 1, 1, 1, 1, 0, 1, 1, 0, 0, 0, 0, 0, 1], ['x1', 'x2', 'x3'])
Expected Output: ['x1', 'x2', 'x3']
Explanation: Edge case: there are exactly 3 features, so all of them are returned after ranking.
Solution
def solution(X, y, feature_names):
import math
num_features = len(X)
num_samples = len(y)
def solve_linear(A, b):
n = len(b)
M = [A[i][:] + [b[i]] for i in range(n)]
for col in range(n):
pivot = max(range(col, n), key=lambda r: abs(M[r][col]))
M[col], M[pivot] = M[pivot], M[col]
if abs(M[col][col]) < 1e-12:
M[col][col] = 1e-12
pivot_value = M[col][col]
for j in range(col, n + 1):
M[col][j] /= pivot_value
for r in range(n):
if r == col:
continue
factor = M[r][col]
if factor == 0:
continue
for j in range(col, n + 1):
M[r][j] -= factor * M[col][j]
return [M[i][n] for i in range(n)]
def sigmoid(z):
if z >= 0:
e = math.exp(-z)
return 1.0 / (1.0 + e)
e = math.exp(z)
return e / (1.0 + e)
means = []
stds = []
for i in range(num_features):
col = [float(v) for v in X[i]]
mean = sum(col) / num_samples
var = sum((v - mean) ** 2 for v in col) / num_samples
std = math.sqrt(var)
if std == 0:
std = 1.0
means.append(mean)
stds.append(std)
Z = []
for j in range(num_samples):
row = [1.0]
for i in range(num_features):
row.append((float(X[i][j]) - means[i]) / stds[i])
Z.append(row)
beta = [0.0] * (num_features + 1)
lam = 1e-6
for _ in range(40):
probs = []
for row in Z:
z = sum(beta[k] * row[k] for k in range(num_features + 1))
probs.append(sigmoid(z))
grad = [0.0] * (num_features + 1)
H = [[0.0] * (num_features + 1) for _ in range(num_features + 1)]
for idx, row in enumerate(Z):
p = probs[idx]
diff = p - float(y[idx])
w = p * (1.0 - p)
for a in range(num_features + 1):
grad[a] += diff * row[a]
for a in range(num_features + 1):
ra = row[a]
for b in range(a, num_features + 1):
H[a][b] += w * ra * row[b]
for a in range(num_features + 1):
for b in range(a):
H[a][b] = H[b][a]
for a in range(1, num_features + 1):
grad[a] += lam * beta[a]
H[a][a] += lam
delta = solve_linear(H, grad)
max_change = 0.0
for i in range(num_features + 1):
beta[i] -= delta[i]
max_change = max(max_change, abs(delta[i]))
if max_change < 1e-8:
break
coeffs = []
for i in range(num_features):
coeffs.append(beta[i + 1] / stds[i])
ranked = sorted(
range(num_features),
key=lambda i: (-round(abs(coeffs[i]), 12), feature_names[i])
)
return [feature_names[i] for i in ranked[:3]]
Time complexity: O(T * (n * d^2 + d^3)), where n is the number of samples, d is the number of features + 1, and T is the number of Newton iterations. Space complexity: O(n * d + d^2).
Hints
- Because X is feature-major, you may want to conceptually transpose it so each training example becomes one row.
- A simple way to fit logistic regression without external libraries is Newton's method (IRLS) or gradient descent with an intercept term.