Online bandit with 3 variants, churn guardrail, and delayed conversions
Context
You are running an online experiment with 3 variants (including control). The primary objective is to maximize conversions. There is a hard guardrail on churn: any increase in churn above a specified tolerance must trigger mitigation. Conversions are delayed relative to exposure.
Tasks
(a) Design a Thompson Sampling bandit:
-
Specify likelihoods and priors for both the primary metric and the churn guardrail.
-
Explain how you will handle delayed feedback (optimistic versus debiased estimators) and non-stationarity (e.g., discounting or sliding windows).
(b) Set traffic floors and fairness constraints across variants and key strata.
(c) Define stopping and rollback policies when the churn guardrail is breached.
(d) Compare expected regret and business impact against a fixed-horizon A/B test under seasonality.
Assumptions to make explicit
-
3 variants include control.
-
Primary outcome is binary conversion within a defined window; churn is a binary event within a defined window; both can arrive with delay.
-
Guardrail tolerance is an absolute churn increase threshold g_max (could be set to 0 for no increase allowed).