PracHub
QuestionsPremiumLearningGuidesInterview PrepCoaches

Quick Overview

This multi-part question evaluates a data scientist's competencies in statistical inference (sample size and z-test calculations), causal inference and parallel-trends validation (difference-in-differences), Bayesian probability updating, and interpretable supervised learning feature importance, all framed as coding tasks.

  • easy
  • Roblox
  • Coding & Algorithms
  • Data Scientist

Implement four DS coding tasks

Company: Roblox

Role: Data Scientist

Category: Coding & Algorithms

Difficulty: easy

Interview Round: Take-home Project

You are completing a CodeSignal-style assessment (Python or R). Implement solutions for the following four independent questions. ## 1) Two-sample z-test: required sample size You are given: - `x`: numeric array of historical observations for the metric (use it to estimate the metric standard deviation `sigma`) - `alpha`: significance level (e.g., 0.05) - `power`: desired power (e.g., 0.8) - `effect_size`: the minimum detectable absolute difference in means, \(\Delta\) Assumptions: - Two-sided **two-sample z-test** for a difference in means. - Treatment and control have **equal** sample size \(n\). - Use \(\hat\sigma = \text{std}(x)\) as the population standard deviation estimate. Task: - Return the **minimum integer per-group sample size** `n` required to detect `effect_size` at level `alpha` with `power`. ## 2) Difference-in-Differences (DiD) + parallel-trend validation You are given three equal-length arrays: - `period[i]`: time indicator (contains at least a “pre” and a “post” period; may contain multiple pre periods) - `group[i]`: 0 = control, 1 = treatment - `outcome[i]`: numeric outcome And a numeric `threshold` for trend validation. Definitions: - Let \(\bar{Y}_{g,t}\) be the mean outcome for group \(g\in\{0,1\}\) in period \(t\). - The DiD estimate is: \[ \text{DiD} = (\bar{Y}_{1,post}-\bar{Y}_{1,pre}) - (\bar{Y}_{0,post}-\bar{Y}_{0,pre}). \] Parallel-trend / trend validation requirement: - If there are **multiple pre periods**, compute the group difference \(d_t = \bar{Y}_{1,t} - \bar{Y}_{0,t}\) for each pre period \(t\), sort pre periods by time, and validate: \[ \max_t |d_{t} - d_{t-1}| \le \text{threshold}. \] - If there is only a single pre period, treat trend validation as passing. Task: - Return (a) the DiD estimate and (b) whether the pre-trend validation passes under the `threshold`. ## 3) Bayes’ rule posterior probability You are given probabilities (as floats) describing an event \(A\) and evidence \(B\), such as: - `p_A` = \(P(A)\) - `p_B_given_A` = \(P(B\mid A)\) - `p_B_given_not_A` = \(P(B\mid \neg A)\) Task: - Compute and return the posterior probability \(P(A\mid B)\). ## 4) Logistic regression: top-3 features You are given: - `X`: a 2D array where each **row corresponds to one feature** and each **column corresponds to one observation** (shape: `num_features × num_samples`) - `y`: binary outcome array of length `num_samples` (values in {0,1}) - `feature_names`: array of length `num_features` Task: - Fit a logistic regression model to predict `y` from `X` (include an intercept). - Rank features by **absolute value of their fitted coefficient** (exclude the intercept). - Return the **names of the top 3 features** in descending order of importance. Notes: - Handle ties deterministically (e.g., break ties by feature name ascending). - Assume inputs are well-formed and numeric.

Quick Answer: This multi-part question evaluates a data scientist's competencies in statistical inference (sample size and z-test calculations), causal inference and parallel-trends validation (difference-in-differences), Bayesian probability updating, and interpretable supervised learning feature importance, all framed as coding tasks.

Part 1: Two-Sample Z-Test Required Sample Size

You are given historical observations for a metric and must estimate its population standard deviation using the population standard deviation of x. Then compute the minimum per-group sample size needed for a two-sided two-sample z-test with equal group sizes to detect a given absolute effect size at significance level alpha and desired power.

Constraints

  • 1 <= len(x) <= 10^5
  • 0 < alpha < 1
  • 0 < power < 1
  • effect_size > 0
  • Use the population standard deviation: sqrt(sum((xi - mean)^2) / len(x))

Examples

Input: ([10, 12, 14, 16], 0.05, 0.8, 2.0)

Expected Output: 20

Explanation: The population variance is 5, so sigma = sqrt(5). Plugging into the formula gives about 19.62, so the answer is 20.

Input: ([1, 2, 3, 4, 5], 0.1, 0.9, 1.5)

Expected Output: 16

Explanation: The population variance is 2. The computed n is about 15.22, so the minimum integer per-group size is 16.

Input: ([7], 0.05, 0.8, 0.5)

Expected Output: 1

Explanation: Edge case: a single historical value gives zero estimated variance. The implementation returns the smallest valid per-group size, 1.

Input: ([1, 1, 1, 2], 0.05, 0.8, 0.1)

Expected Output: 295

Explanation: A very small detectable effect size requires a much larger sample size.

Solution

def solution(x, alpha, power, effect_size):
    import math
    from statistics import NormalDist

    mean_x = sum(x) / len(x)
    variance = sum((v - mean_x) ** 2 for v in x) / len(x)
    sigma = math.sqrt(variance)

    if sigma == 0:
        return 1

    z_alpha = NormalDist().inv_cdf(1 - alpha / 2.0)
    z_power = NormalDist().inv_cdf(power)

    n = 2.0 * (sigma ** 2) * (z_alpha + z_power) ** 2 / (effect_size ** 2)
    return max(1, math.ceil(n))

Time complexity: O(n). Space complexity: O(1).

Hints

  1. For equal-sized groups in a two-sample z-test, the standard formula is n = 2 * sigma^2 * (z_(1-alpha/2) + z_power)^2 / effect_size^2.
  2. After computing the real-valued sample size, take the ceiling because you need the minimum integer that still satisfies the requirement.

Part 2: Difference-in-Differences with Pre-Trend Validation

You are given arrays period, group, and outcome. Period labels are strings that start with 'pre' or 'post' (for example: 'pre', 'pre1', 'pre2', 'post'). Compute the Difference-in-Differences estimate using the overall mean across all pre rows and all post rows. If there are multiple distinct pre periods, also validate the parallel-trend assumption by checking whether the largest change in treatment-control difference between consecutive pre periods is at most the given threshold.

Constraints

  • len(period) == len(group) == len(outcome)
  • 4 <= len(period) <= 10^5
  • group[i] is either 0 or 1
  • Each group has at least one observation in the overall pre bucket and the overall post bucket
  • For every distinct pre period used in trend validation, both groups appear at least once

Examples

Input: (['pre', 'pre', 'post', 'post'], [0, 1, 0, 1], [10, 12, 11, 15], 0.5)

Expected Output: (2.0, True)

Explanation: Single pre period, so trend validation automatically passes. DiD = (15 - 12) - (11 - 10) = 2.

Input: (['pre1', 'pre1', 'pre2', 'pre2', 'post', 'post'], [0, 1, 0, 1, 0, 1], [10, 12, 11, 13, 12, 17], 0.1)

Expected Output: (3.0, True)

Explanation: Pre-period differences are 2 and 2, so the maximum change is 0 <= 0.1. The DiD estimate is 3.0.

Input: (['pre1', 'pre1', 'pre2', 'pre2', 'post', 'post'], [0, 1, 0, 1, 0, 1], [10, 12, 10, 15, 12, 18], 2.0)

Expected Output: (2.5, False)

Explanation: Pre-period differences are 2 and 5, so the change is 3, which exceeds the threshold 2.0.

Input: (['pre1', 'pre1', 'pre1', 'pre1', 'pre2', 'pre2', 'pre2', 'pre2', 'post', 'post', 'post', 'post'], [0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 1, 1], [4, 6, 5, 7, 5, 5, 6, 8, 6, 6, 9, 11], 1.0)

Expected Output: (2.5, True)

Explanation: Edge case with multiple observations per cell. The pre-trend change is exactly 1.0, so it passes.

Solution

def solution(period, group, outcome, threshold):
    import re
    from collections import defaultdict

    def bucket(label):
        s = str(label).lower()
        if s.startswith('pre'):
            return 'pre'
        return 'post'

    def pre_order(label):
        s = str(label).lower()
        match = re.search(r'(-?\d+)$', s)
        num = int(match.group(1)) if match else 0
        return (num, s)

    overall_sum = defaultdict(float)
    overall_count = defaultdict(int)
    cell_sum = defaultdict(float)
    cell_count = defaultdict(int)
    pre_labels = set()

    for t, g, y in zip(period, group, outcome):
        b = bucket(t)
        overall_sum[(g, b)] += y
        overall_count[(g, b)] += 1
        if b == 'pre':
            cell_sum[(g, t)] += y
            cell_count[(g, t)] += 1
            pre_labels.add(t)

    mean_treat_post = overall_sum[(1, 'post')] / overall_count[(1, 'post')]
    mean_treat_pre = overall_sum[(1, 'pre')] / overall_count[(1, 'pre')]
    mean_ctrl_post = overall_sum[(0, 'post')] / overall_count[(0, 'post')]
    mean_ctrl_pre = overall_sum[(0, 'pre')] / overall_count[(0, 'pre')]

    did = (mean_treat_post - mean_treat_pre) - (mean_ctrl_post - mean_ctrl_pre)

    ordered_pre = sorted(pre_labels, key=pre_order)
    if len(ordered_pre) <= 1:
        return (did, True)

    diffs = []
    for t in ordered_pre:
        treat_mean = cell_sum[(1, t)] / cell_count[(1, t)]
        ctrl_mean = cell_sum[(0, t)] / cell_count[(0, t)]
        diffs.append(treat_mean - ctrl_mean)

    max_change = 0.0
    for i in range(1, len(diffs)):
        max_change = max(max_change, abs(diffs[i] - diffs[i - 1]))

    return (did, max_change <= threshold + 1e-12)

Time complexity: O(n + k log k), where k is the number of distinct pre periods. Space complexity: O(k).

Hints

  1. First compute overall means for treatment/control in the combined pre and combined post buckets to get the DiD estimate.
  2. For trend validation, compute treatment minus control separately for each distinct pre period, sort the pre periods by time order, then look at consecutive changes.

Part 3: Bayes' Rule Posterior Probability

Given P(A), P(B|A), and P(B|not A), compute the posterior probability P(A|B) using Bayes' rule.

Constraints

  • 0 <= p_A <= 1
  • 0 <= p_B_given_A <= 1
  • 0 <= p_B_given_not_A <= 1
  • p_B_given_A * p_A + p_B_given_not_A * (1 - p_A) > 0

Examples

Input: (0.01, 0.9, 0.05)

Expected Output: 0.15384615384615385

Explanation: A standard rare-event example: even with strong evidence, the base rate matters.

Input: (0.3, 0.8, 0.2)

Expected Output: 0.631578947368421

Explanation: Numerator = 0.24 and denominator = 0.38.

Input: (0.0, 0.9, 0.1)

Expected Output: 0.0

Explanation: Edge case: if P(A) = 0, then P(A|B) must also be 0.

Input: (1.0, 0.7, 0.2)

Expected Output: 1.0

Explanation: Edge case: if P(A) = 1, then P(A|B) must also be 1.

Solution

def solution(p_A, p_B_given_A, p_B_given_not_A):
    numerator = p_B_given_A * p_A
    denominator = numerator + p_B_given_not_A * (1.0 - p_A)
    return numerator / denominator

Time complexity: O(1). Space complexity: O(1).

Hints

  1. Start from Bayes' rule: P(A|B) = P(B|A)P(A) / P(B).
  2. Compute P(B) by splitting on whether A happens or not.

Part 4: Logistic Regression Top-3 Features

You are given a feature matrix X where each row is a feature and each column is an observation, a binary target array y, and the corresponding feature names. Fit a logistic regression model with an intercept, rank features by the absolute value of their fitted coefficients, exclude the intercept, and return the names of the top 3 features in descending order of importance. If two features have the same absolute coefficient, break ties by feature name in ascending alphabetical order.

Constraints

  • 3 <= number of features <= 20
  • 1 <= number of samples <= 500
  • len(feature_names) == number of features
  • Each row in X has length equal to len(y)
  • y contains only 0 and 1
  • y contains at least one 0 and at least one 1

Examples

Input: ([[0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1]], [0, 0, 0, 0, 1, 1, 1, 1, 0, 1, 1, 1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1], ['A', 'B', 'C', 'D'])

Expected Output: ['A', 'B', 'C']

Explanation: The grouped synthetic data is built so feature A has the strongest positive effect, followed by B, then C.

Input: ([[0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1]], [0, 0, 0, 1, 1, 1, 1, 0, 1, 1, 1, 0, 1, 1, 0, 0, 0, 0, 0, 1], ['beta', 'alpha', 'gamma', 'delta'])

Expected Output: ['alpha', 'beta', 'gamma']

Explanation: Alpha and beta are tied in strength, so the tie is broken alphabetically by feature name.

Input: ([[0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1]], [0, 0, 0, 1, 1, 1, 1, 0, 1, 1, 0, 0, 0, 0, 0, 1], ['x1', 'x2', 'x3'])

Expected Output: ['x1', 'x2', 'x3']

Explanation: Edge case: there are exactly 3 features, so all of them are returned after ranking.

Solution

def solution(X, y, feature_names):
    import math

    num_features = len(X)
    num_samples = len(y)

    def solve_linear(A, b):
        n = len(b)
        M = [A[i][:] + [b[i]] for i in range(n)]

        for col in range(n):
            pivot = max(range(col, n), key=lambda r: abs(M[r][col]))
            M[col], M[pivot] = M[pivot], M[col]

            if abs(M[col][col]) < 1e-12:
                M[col][col] = 1e-12

            pivot_value = M[col][col]
            for j in range(col, n + 1):
                M[col][j] /= pivot_value

            for r in range(n):
                if r == col:
                    continue
                factor = M[r][col]
                if factor == 0:
                    continue
                for j in range(col, n + 1):
                    M[r][j] -= factor * M[col][j]

        return [M[i][n] for i in range(n)]

    def sigmoid(z):
        if z >= 0:
            e = math.exp(-z)
            return 1.0 / (1.0 + e)
        e = math.exp(z)
        return e / (1.0 + e)

    means = []
    stds = []
    for i in range(num_features):
        col = [float(v) for v in X[i]]
        mean = sum(col) / num_samples
        var = sum((v - mean) ** 2 for v in col) / num_samples
        std = math.sqrt(var)
        if std == 0:
            std = 1.0
        means.append(mean)
        stds.append(std)

    Z = []
    for j in range(num_samples):
        row = [1.0]
        for i in range(num_features):
            row.append((float(X[i][j]) - means[i]) / stds[i])
        Z.append(row)

    beta = [0.0] * (num_features + 1)
    lam = 1e-6

    for _ in range(40):
        probs = []
        for row in Z:
            z = sum(beta[k] * row[k] for k in range(num_features + 1))
            probs.append(sigmoid(z))

        grad = [0.0] * (num_features + 1)
        H = [[0.0] * (num_features + 1) for _ in range(num_features + 1)]

        for idx, row in enumerate(Z):
            p = probs[idx]
            diff = p - float(y[idx])
            w = p * (1.0 - p)

            for a in range(num_features + 1):
                grad[a] += diff * row[a]

            for a in range(num_features + 1):
                ra = row[a]
                for b in range(a, num_features + 1):
                    H[a][b] += w * ra * row[b]

        for a in range(num_features + 1):
            for b in range(a):
                H[a][b] = H[b][a]

        for a in range(1, num_features + 1):
            grad[a] += lam * beta[a]
            H[a][a] += lam

        delta = solve_linear(H, grad)

        max_change = 0.0
        for i in range(num_features + 1):
            beta[i] -= delta[i]
            max_change = max(max_change, abs(delta[i]))

        if max_change < 1e-8:
            break

    coeffs = []
    for i in range(num_features):
        coeffs.append(beta[i + 1] / stds[i])

    ranked = sorted(
        range(num_features),
        key=lambda i: (-round(abs(coeffs[i]), 12), feature_names[i])
    )

    return [feature_names[i] for i in ranked[:3]]

Time complexity: O(T * (n * d^2 + d^3)), where n is the number of samples, d is the number of features + 1, and T is the number of Newton iterations. Space complexity: O(n * d + d^2).

Hints

  1. Because X is feature-major, you may want to conceptually transpose it so each training example becomes one row.
  2. A simple way to fit logistic regression without external libraries is Newton's method (IRLS) or gradient descent with an intercept term.
Last updated: May 17, 2026

Loading coding console...

PracHub

Master your tech interviews with 7,500+ real questions from top companies.

Product

  • Questions
  • Learning Tracks
  • Interview Guides
  • Resources
  • Premium
  • For Universities
  • Student Access

Browse

  • By Company
  • By Role
  • By Category
  • Topic Hubs
  • SQL Questions
  • Compare Platforms
  • Discord Community

Support

  • support@prachub.com
  • (916) 541-4762

Legal

  • Privacy Policy
  • Terms of Service
  • About Us

© 2026 PracHub. All rights reserved.

Related Coding Questions

  • Find Windows Containing a Target - Roblox (medium)
  • Implement Sliding-Window Rate Limiter - Roblox (medium)
  • Find target-heavy sliding windows - Roblox (medium)
  • Find most frequent call path in logs - Roblox (medium)
  • Track Highest-Earning Experience - Roblox (medium)