Design: Contextual Bandit Recommendation with Online Learning
You are designing an online learning recommendation system. At each user interaction:
-
You receive exactly 4 candidate items from an upstream candidate generator.
-
You must choose exactly 1 item to show the user.
-
You receive immediate feedback (e.g., click or dwell time).
-
The model must update online so that future selections improve over time.
Provide a design that covers:
-
Model choice (with justification) for a contextual bandit setup.
-
Feature engineering for users, items, and context, including handling cold start.
-
Feedback handling and reward definition, including delayed/implicit signals and logging for learning.
-
Exploration–exploitation strategy and the selection algorithm.
-
Offline evaluation methodology and online experimentation/monitoring.
State any minimal assumptions you need (e.g., feedback semantics, latency constraints), and make your design robust to non-stationarity and scale.