RL System Design: Per‑User Spending Limits
You are designing a reinforcement learning (RL) system to set per-user spending limits in a payments/risk context. The goal is to balance revenue and user experience against fraud/credit losses and regulatory compliance.
Task
Define and justify the RL formulation and training/deployment approach:
-
Environment/MDP
-
State representation: What customer, risk, and context features are included? How are they featurized and updated over time?
-
Action space: How are spending limit decisions represented (e.g., absolute limit vs. incremental adjustments; discrete vs. continuous)? Include any action masks.
-
Transition dynamics: What drives state evolution and partial observability? How does the policy influence future states and outcomes?
-
Reward signal: Specify the components (e.g., profit, expected credit/fraud losses, user satisfaction/friction, regulatory penalties) and how you aggregate/discount them.
-
Training approach
-
Describe how to use logged historical decisions to train: offline RL vs. contextual bandits. When would you pick each?
-
Exploration under risk constraints
-
Propose an exploration strategy that respects hard safety constraints while still learning.
-
Off‑policy evaluation (OPE)
-
How will you evaluate candidate policies before online deployment, including sequential and bandit cases?
-
Safety guardrails
-
Define policy- and system‑level controls that prevent harmful actions and enable safe rollout.
-
Cold start
-
How will you handle new users or merchants with little or no history?
-
Non‑stationarity
-
How will you detect and adapt to distribution shifts (seasonality, new fraud patterns, macro shocks)?
-
Deployment
-
Outline a cautious rollout plan and real‑time monitoring for this RL system.