This question evaluates understanding of multi-armed bandit principles and contextual bandits, covering algorithmic trade-offs (regret, exploration–exploitation balance, and modeling assumptions) among epsilon-greedy, UCB, and Thompson sampling, along with operational concerns such as delayed or batched rewards, non‑stationarity, offline policy evaluation, and production safety. It is commonly asked in Analytics & Experimentation and machine learning interviews because it probes both conceptual understanding and practical application of online decision-making, testing the ability to reason about algorithm selection, performance trade-offs, and deployment considerations.
You are designing online decision-making for a large-scale product (e.g., recommendations, pricing, notifications) where you must learn from user interactions while maximizing outcomes.
Login required