PracHub
QuestionsPremiumLearningGuidesInterview PrepNEWCoaches
|Home/Statistics & Math/TikTok

Interpret and validate regression with interactions

Last updated: Mar 29, 2026

Quick Overview

This question evaluates understanding of regression modeling with interaction terms, binary outcome models and odds/marginal-effect interpretation, causal inference concepts like endogeneity and instrumental variables, model diagnostics and standard-error choices, and count-data modeling.

  • hard
  • TikTok
  • Statistics & Math
  • Data Scientist

Interpret and validate regression with interactions

Company: TikTok

Role: Data Scientist

Category: Statistics & Math

Difficulty: hard

Interview Round: Technical Screen

Consider modeling 7‑day retention (retained_7d, 0/1) using user‑level data with the linear probability model and with logistic regression. Features: treated (1 if exposed to new preloading), watch_time_day1 (minutes), new_user (1/0), and fixed effects for country and signup_date. The model includes an interaction treated × new_user. (a) Write both model specifications (LPM and logit) and interpret the coefficients on treated and on treated × new_user. For logit, translate a coefficient into an odds ratio and then into an approximate marginal effect at the mean. (b) Discuss the assumptions and diagnostics you would check: heteroskedasticity, separation, multicollinearity (e.g., VIF), misspecification (e.g., link test), and calibration (e.g., reliability plots). What standard errors would you report and why (HC-robust vs clustered)? Cluster choice justification. (c) Suppose watch_time_day1 is endogenous (e.g., driven by unobserved preference). Propose two remedies and their assumptions: control function/IV approach and panel fixed‑effects with within‑user variation. What instruments or proxies could be plausible here? (d) You also have count data for daily videos watched. When would Poisson or negative binomial be preferred over OLS? Explain how you’d check overdispersion and interpret the exponentiated coefficients.

Quick Answer: This question evaluates understanding of regression modeling with interaction terms, binary outcome models and odds/marginal-effect interpretation, causal inference concepts like endogeneity and instrumental variables, model diagnostics and standard-error choices, and count-data modeling.

Related Interview Questions

  • Explain Type I/II errors vs precision/recall - TikTok (easy)
  • Compute cluster-aware significance and sequential corrections - TikTok (medium)
  • Model overdispersed counts; estimate treatment lift - TikTok (Medium)
  • Decide if subgroup increases imply overall increase - TikTok (medium)
  • Control confounding in observational ad lift - TikTok (hard)
TikTok logo
TikTok
Oct 13, 2025, 9:49 PM
Data Scientist
Technical Screen
Statistics & Math
2
0

Modeling 7-day Retention with LPM and Logistic Regression

Context

You have user-level data with a binary outcome retained_7d (1 if the user is active 7 days after signup; 0 otherwise). Covariates include:

  • treated: 1 if the user was exposed to a new preloading feature; 0 otherwise
  • watch_time_day1: minutes watched on day 1 (continuous)
  • new_user: 1 if a new user segment; 0 otherwise
  • Fixed effects (FE) for country and signup_date

The model includes an interaction term treated × new_user.

Answer the following:

(a) Model specification and interpretation

  • Write the Linear Probability Model (LPM) and the logistic regression specification (with fixed effects and the interaction).
  • Interpret the coefficients on treated and on treated × new_user for both models.
  • For the logistic model, translate a coefficient into an odds ratio and then into an approximate marginal effect at the mean.

(b) Assumptions, diagnostics, and standard errors

Discuss assumptions and diagnostics you would check:

  • Heteroskedasticity
  • Separation (logit)
  • Multicollinearity (e.g., VIF)
  • Misspecification (e.g., link test, functional form)
  • Calibration (e.g., reliability plots)

What standard errors would you report and why (HC-robust vs clustered)? Justify your cluster choice.

(c) Endogeneity of watch_time_day1

Suppose watch_time_day1 is endogenous (e.g., driven by unobserved preference). Propose two remedies and their assumptions:

  • Control function/IV approach
  • Panel fixed effects using within-user variation

What instruments or proxies could be plausible here?

(d) Count outcomes for daily videos watched

You also have count data for daily videos watched. When would Poisson or negative binomial be preferred over OLS? How would you check overdispersion and interpret the exponentiated coefficients?

Solution

Show

Comments (0)

Sign in to leave a comment

Loading comments...

Browse More Questions

More Statistics & Math•More TikTok•More Data Scientist•TikTok Data Scientist•TikTok Statistics & Math•Data Scientist Statistics & Math
PracHub

Master your tech interviews with 7,500+ real questions from top companies.

Product

  • Questions
  • Learning Tracks
  • Interview Guides
  • Resources
  • Premium
  • For Universities
  • Student Access

Browse

  • By Company
  • By Role
  • By Category
  • Topic Hubs
  • SQL Questions
  • Compare Platforms
  • Discord Community

Support

  • support@prachub.com
  • (916) 541-4762

Legal

  • Privacy Policy
  • Terms of Service
  • About Us

© 2026 PracHub. All rights reserved.