You repeatedly choose between two casino games (arms) G1 and G2 over rounds t = 1, …, N. Each play of arm i yields a Bernoulli reward with an unknown success probability p_i. Your goals are:
Assume rewards are i.i.d. per arm and independent across arms.
(a) Propose and justify an exploration–exploitation strategy (e.g., UCB1 or Thompson sampling) and give the update rules.
(b) Specify a stopping/commitment rule and derive the sample complexity needed to identify the better arm when |p1 − p2| ≥ δ with error ≤ 5%.
(c) Discuss how prior information, unequal play costs, or nonstationarity would change your design.
Login required