How do I practice coding and algorithm questions?

Use PracHub's coding console to write, test, and debug your solutions in Python or JavaScript. View hints, test against sample inputs, and compare with official solutions.

What difficulty level is this coding question?

This is a easy difficulty Coding & Algorithms question, commonly asked during Take-home Project rounds at Roblox.

What role is this question designed for?

This question is commonly asked for Data Scientist candidates at Roblox during technical interviews.

Implement four DS coding tasks | Roblox Coding Question

Quick Overview

This multi-part question evaluates a data scientist's competencies in statistical inference (sample size and z-test calculations), causal inference and parallel-trends validation (difference-in-differences), Bayesian probability updating, and interpretable supervised learning feature importance, all framed as coding tasks.

Implement four DS coding tasks

Company: Roblox

Role: Data Scientist

Category: Coding & Algorithms

Difficulty: easy

Interview Round: Take-home Project

You are completing a CodeSignal-style assessment (Python or R). Implement solutions for the following four independent questions. ## 1) Two-sample z-test: required sample size You are given: - `x`: numeric array of historical observations for the metric (use it to estimate the metric standard deviation `sigma`) - `alpha`: significance level (e.g., 0.05) - `power`: desired power (e.g., 0.8) - `effect_size`: the minimum detectable absolute difference in means, \(\Delta\) Assumptions: - Two-sided **two-sample z-test** for a difference in means. - Treatment and control have **equal** sample size \(n\). - Use \(\hat\sigma = \text{std}(x)\) as the population standard deviation estimate. Task: - Return the **minimum integer per-group sample size** `n` required to detect `effect_size` at level `alpha` with `power`. ## 2) Difference-in-Differences (DiD) + parallel-trend validation You are given three equal-length arrays: - `period[i]`: time indicator (contains at least a “pre” and a “post” period; may contain multiple pre periods) - `group[i]`: 0 = control, 1 = treatment - `outcome[i]`: numeric outcome And a numeric `threshold` for trend validation. Definitions: - Let \(\bar{Y}_{g,t}\) be the mean outcome for group \(g\in\{0,1\}\) in period \(t\). - The DiD estimate is: \[ \text{DiD} = (\bar{Y}_{1,post}-\bar{Y}_{1,pre}) - (\bar{Y}_{0,post}-\bar{Y}_{0,pre}). \] Parallel-trend / trend validation requirement: - If there are **multiple pre periods**, compute the group difference \(d_t = \bar{Y}_{1,t} - \bar{Y}_{0,t}\) for each pre period \(t\), sort pre periods by time, and validate: \[ \max_t |d_{t} - d_{t-1}| \le \text{threshold}. \] - If there is only a single pre period, treat trend validation as passing. Task: - Return (a) the DiD estimate and (b) whether the pre-trend validation passes under the `threshold`. ## 3) Bayes’ rule posterior probability You are given probabilities (as floats) describing an event \(A\) and evidence \(B\), such as: - `p_A` = \(P(A)\) - `p_B_given_A` = \(P(B\mid A)\) - `p_B_given_not_A` = \(P(B\mid \neg A)\) Task: - Compute and return the posterior probability \(P(A\mid B)\). ## 4) Logistic regression: top-3 features You are given: - `X`: a 2D array where each **row corresponds to one feature** and each **column corresponds to one observation** (shape: `num_features × num_samples`) - `y`: binary outcome array of length `num_samples` (values in {0,1}) - `feature_names`: array of length `num_features` Task: - Fit a logistic regression model to predict `y` from `X` (include an intercept). - Rank features by **absolute value of their fitted coefficient** (exclude the intercept). - Return the **names of the top 3 features** in descending order of importance. Notes: - Handle ties deterministically (e.g., break ties by feature name ascending). - Assume inputs are well-formed and numeric.

Quick Answer: This multi-part question evaluates a data scientist's competencies in statistical inference (sample size and z-test calculations), causal inference and parallel-trends validation (difference-in-differences), Bayesian probability updating, and interpretable supervised learning feature importance, all framed as coding tasks.

Part 1: Two-Sample Z-Test Required Sample Size

You are given historical observations for a metric and must estimate its population standard deviation using the population standard deviation of x. Then compute the minimum per-group sample size needed for a two-sided two-sample z-test with equal group sizes to detect a given absolute effect size at significance level alpha and desired power.

Constraints

1 <= len(x) <= 10^5
0 < alpha < 1
0 < power < 1
effect_size > 0
Use the population standard deviation: sqrt(sum((xi - mean)^2) / len(x))

Examples

Input: ([10, 12, 14, 16], 0.05, 0.8, 2.0)

Expected Output: 20

Explanation: The population variance is 5, so sigma = sqrt(5). Plugging into the formula gives about 19.62, so the answer is 20.

Input: ([1, 2, 3, 4, 5], 0.1, 0.9, 1.5)

Expected Output: 16

Explanation: The population variance is 2. The computed n is about 15.22, so the minimum integer per-group size is 16.

Hints

For equal-sized groups in a two-sample z-test, the standard formula is n = 2 * sigma^2 * (z_(1-alpha/2) + z_power)^2 / effect_size^2.
After computing the real-valued sample size, take the ceiling because you need the minimum integer that still satisfies the requirement.

Part 2: Difference-in-Differences with Pre-Trend Validation

You are given arrays period, group, and outcome. Period labels are strings that start with 'pre' or 'post' (for example: 'pre', 'pre1', 'pre2', 'post'). Compute the Difference-in-Differences estimate using the overall mean across all pre rows and all post rows. If there are multiple distinct pre periods, also validate the parallel-trend assumption by checking whether the largest change in treatment-control difference between consecutive pre periods is at most the given threshold.

Constraints

len(period) == len(group) == len(outcome)
4 <= len(period) <= 10^5
group[i] is either 0 or 1
Each group has at least one observation in the overall pre bucket and the overall post bucket
For every distinct pre period used in trend validation, both groups appear at least once

Examples

Input: (['pre', 'pre', 'post', 'post'], [0, 1, 0, 1], [10, 12, 11, 15], 0.5)

Expected Output: (2.0, True)

Explanation: Single pre period, so trend validation automatically passes. DiD = (15 - 12) - (11 - 10) = 2.

Input: (['pre1', 'pre1', 'pre2', 'pre2', 'post', 'post'], [0, 1, 0, 1, 0, 1], [10, 12, 11, 13, 12, 17], 0.1)

Expected Output: (3.0, True)

Explanation: Pre-period differences are 2 and 2, so the maximum change is 0 <= 0.1. The DiD estimate is 3.0.

Hints

First compute overall means for treatment/control in the combined pre and combined post buckets to get the DiD estimate.
For trend validation, compute treatment minus control separately for each distinct pre period, sort the pre periods by time order, then look at consecutive changes.

Part 3: Bayes' Rule Posterior Probability

Given P(A), P(B|A), and P(B|not A), compute the posterior probability P(A|B) using Bayes' rule.

Constraints

0 <= p_A <= 1
0 <= p_B_given_A <= 1
0 <= p_B_given_not_A <= 1
p_B_given_A * p_A + p_B_given_not_A * (1 - p_A) > 0

Examples

Input: (0.01, 0.9, 0.05)

Expected Output: 0.15384615384615385

Explanation: A standard rare-event example: even with strong evidence, the base rate matters.

Input: (0.3, 0.8, 0.2)

Expected Output: 0.631578947368421

Explanation: Numerator = 0.24 and denominator = 0.38.

Hints

Start from Bayes' rule: P(A|B) = P(B|A)P(A) / P(B).
Compute P(B) by splitting on whether A happens or not.

Part 4: Logistic Regression Top-3 Features

You are given a feature matrix X where each row is a feature and each column is an observation, a binary target array y, and the corresponding feature names. Fit a logistic regression model with an intercept, rank features by the absolute value of their fitted coefficients, exclude the intercept, and return the names of the top 3 features in descending order of importance. If two features have the same absolute coefficient, break ties by feature name in ascending alphabetical order.

Constraints

3 <= number of features <= 20
1 <= number of samples <= 500
len(feature_names) == number of features
Each row in X has length equal to len(y)
y contains only 0 and 1
y contains at least one 0 and at least one 1

Examples

Input: ([[0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1]], [0, 0, 0, 0, 1, 1, 1, 1, 0, 1, 1, 1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1], ['A', 'B', 'C', 'D'])

Expected Output: ['A', 'B', 'C']

Explanation: The grouped synthetic data is built so feature A has the strongest positive effect, followed by B, then C.

Input: ([[0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1]], [0, 0, 0, 1, 1, 1, 1, 0, 1, 1, 1, 0, 1, 1, 0, 0, 0, 0, 0, 1], ['beta', 'alpha', 'gamma', 'delta'])

Expected Output: ['alpha', 'beta', 'gamma']

Explanation: Alpha and beta are tied in strength, so the tie is broken alphabetically by feature name.

Hints

Because X is feature-major, you may want to conceptually transpose it so each training example becomes one row.
A simple way to fit logistic regression without external libraries is Newton's method (IRLS) or gradient descent with an intercept term.

Quick Overview

Implement four DS coding tasks

Company: Roblox

Role: Data Scientist

Category: Coding & Algorithms

Difficulty: easy

Interview Round: Take-home Project

Part 1: Two-Sample Z-Test Required Sample Size

Constraints

1 <= len(x) <= 10^5
0 < alpha < 1
0 < power < 1
effect_size > 0
Use the population standard deviation: sqrt(sum((xi - mean)^2) / len(x))

Examples

Input: ([10, 12, 14, 16], 0.05, 0.8, 2.0)

Expected Output: 20

Explanation: The population variance is 5, so sigma = sqrt(5). Plugging into the formula gives about 19.62, so the answer is 20.

Input: ([1, 2, 3, 4, 5], 0.1, 0.9, 1.5)

Expected Output: 16

Explanation: The population variance is 2. The computed n is about 15.22, so the minimum integer per-group size is 16.

Hints

For equal-sized groups in a two-sample z-test, the standard formula is n = 2 * sigma^2 * (z_(1-alpha/2) + z_power)^2 / effect_size^2.
After computing the real-valued sample size, take the ceiling because you need the minimum integer that still satisfies the requirement.

Part 2: Difference-in-Differences with Pre-Trend Validation

Constraints

len(period) == len(group) == len(outcome)
4 <= len(period) <= 10^5
group[i] is either 0 or 1
Each group has at least one observation in the overall pre bucket and the overall post bucket
For every distinct pre period used in trend validation, both groups appear at least once

Examples

Input: (['pre', 'pre', 'post', 'post'], [0, 1, 0, 1], [10, 12, 11, 15], 0.5)

Expected Output: (2.0, True)

Explanation: Single pre period, so trend validation automatically passes. DiD = (15 - 12) - (11 - 10) = 2.

Input: (['pre1', 'pre1', 'pre2', 'pre2', 'post', 'post'], [0, 1, 0, 1, 0, 1], [10, 12, 11, 13, 12, 17], 0.1)

Expected Output: (3.0, True)

Explanation: Pre-period differences are 2 and 2, so the maximum change is 0 <= 0.1. The DiD estimate is 3.0.

Hints

First compute overall means for treatment/control in the combined pre and combined post buckets to get the DiD estimate.
For trend validation, compute treatment minus control separately for each distinct pre period, sort the pre periods by time order, then look at consecutive changes.

Part 3: Bayes' Rule Posterior Probability

Given P(A), P(B|A), and P(B|not A), compute the posterior probability P(A|B) using Bayes' rule.

Constraints

0 <= p_A <= 1
0 <= p_B_given_A <= 1
0 <= p_B_given_not_A <= 1
p_B_given_A * p_A + p_B_given_not_A * (1 - p_A) > 0

Examples

Input: (0.01, 0.9, 0.05)

Expected Output: 0.15384615384615385

Explanation: A standard rare-event example: even with strong evidence, the base rate matters.

Input: (0.3, 0.8, 0.2)

Expected Output: 0.631578947368421

Explanation: Numerator = 0.24 and denominator = 0.38.

Hints

Start from Bayes' rule: P(A|B) = P(B|A)P(A) / P(B).
Compute P(B) by splitting on whether A happens or not.

Part 4: Logistic Regression Top-3 Features

Constraints

3 <= number of features <= 20
1 <= number of samples <= 500
len(feature_names) == number of features
Each row in X has length equal to len(y)
y contains only 0 and 1
y contains at least one 0 and at least one 1

Examples

Expected Output: ['A', 'B', 'C']

Explanation: The grouped synthetic data is built so feature A has the strongest positive effect, followed by B, then C.

Expected Output: ['alpha', 'beta', 'gamma']

Explanation: Alpha and beta are tied in strength, so the tie is broken alphabetically by feature name.

Hints

Because X is feature-major, you may want to conceptually transpose it so each training example becomes one row.
A simple way to fit logistic regression without external libraries is Newton's method (IRLS) or gradient descent with an intercept term.