How do I approach Statistics & Math interview questions?

Statistics & Math questions require understanding of core concepts and practice. PracHub provides solutions with explanations to help you master statistics & math interviews.

What difficulty level is this interview question?

This is a hard difficulty Statistics & Math question, commonly asked during Technical Screen rounds at TikTok.

What role is this question designed for?

This question is commonly asked for Data Scientist candidates at TikTok during technical interviews.

Interpret and validate regression with interactions

Quick Overview

This question evaluates understanding of regression modeling with interaction terms, binary outcome models and odds/marginal-effect interpretation, causal inference concepts like endogeneity and instrumental variables, model diagnostics and standard-error choices, and count-data modeling.

Modeling 7-day Retention with LPM and Logistic Regression

Context

You have user-level data with a binary outcome retained_7d (1 if the user is active 7 days after signup; 0 otherwise). Covariates include:

treated: 1 if the user was exposed to a new preloading feature; 0 otherwise
watch_time_day1: minutes watched on day 1 (continuous)
new_user: 1 if a new user segment; 0 otherwise
Fixed effects (FE) for country and signup_date

The model includes an interaction term treated × new_user.

Answer the following:

(a) Model specification and interpretation

Write the Linear Probability Model (LPM) and the logistic regression specification (with fixed effects and the interaction).
Interpret the coefficients on treated and on treated × new_user for both models.
For the logistic model, translate a coefficient into an odds ratio and then into an approximate marginal effect at the mean.

(b) Assumptions, diagnostics, and standard errors

Discuss assumptions and diagnostics you would check:

Heteroskedasticity
Separation (logit)
Multicollinearity (e.g., VIF)
Misspecification (e.g., link test, functional form)
Calibration (e.g., reliability plots)

What standard errors would you report and why (HC-robust vs clustered)? Justify your cluster choice.

(c) Endogeneity of watch_time_day1

Suppose watch_time_day1 is endogenous (e.g., driven by unobserved preference). Propose two remedies and their assumptions:

Control function/IV approach
Panel fixed effects using within-user variation

What instruments or proxies could be plausible here?

(d) Count outcomes for daily videos watched

You also have count data for daily videos watched. When would Poisson or negative binomial be preferred over OLS? How would you check overdispersion and interpret the exponentiated coefficients?

Quick Overview

Context

You have user-level data with a binary outcome retained_7d (1 if the user is active 7 days after signup; 0 otherwise). Covariates include:

treated: 1 if the user was exposed to a new preloading feature; 0 otherwise

watch_time_day1: minutes watched on day 1 (continuous)

new_user: 1 if a new user segment; 0 otherwise

Fixed effects (FE) for country and signup_date

The model includes an interaction term treated × new_user.

Answer the following:

(a) Model specification and interpretation

Write the Linear Probability Model (LPM) and the logistic regression specification (with fixed effects and the interaction).

Interpret the coefficients on treated and on treated × new_user for both models.

For the logistic model, translate a coefficient into an odds ratio and then into an approximate marginal effect at the mean.

(b) Assumptions, diagnostics, and standard errors

Discuss assumptions and diagnostics you would check:

Heteroskedasticity

Separation (logit)

Multicollinearity (e.g., VIF)

Misspecification (e.g., link test, functional form)

Calibration (e.g., reliability plots)

What standard errors would you report and why (HC-robust vs clustered)? Justify your cluster choice.

Interpret and validate regression with interactions

Quick Overview

Interpret and validate regression with interactions

Modeling 7-day Retention with LPM and Logistic Regression

Context

(a) Model specification and interpretation

(b) Assumptions, diagnostics, and standard errors

(c) Endogeneity of watch_time_day1

(d) Count outcomes for daily videos watched

Write your answer

Interpret and validate regression with interactions

Quick Overview

Interpret and validate regression with interactions

Modeling 7-day Retention with LPM and Logistic Regression

Context

(a) Model specification and interpretation

(b) Assumptions, diagnostics, and standard errors

(c) Endogeneity of watch_time_day1

(d) Count outcomes for daily videos watched

Write your answer