This question evaluates a data scientist's ability to design product success metrics and a rigorous experiment plan, covering KPI formulation, guardrail metrics and thresholds, randomization and experiment design, logging instrumentation, and quasi-experimental causal inference.

You own 'euro-chat', a B2C customer-support chatbot that aims to deflect agent contacts while preserving customer satisfaction. Design a rigorous success framework and test plan: 1) Propose a single primary KPI at the conversation level that balances containment (no handoff) and customer utility; write its exact formula and define the eligible population, attribution window, and how to treat silent/abandoned chats and recontacts within 72 hours. 2) List at least three guardrail metrics (e.g., CSAT/NPS, refund/return rate, recontact rate) and thresholds; explain trade-offs if the primary KPI improves but a guardrail regresses. 3) Specify the unit of randomization (session vs. user vs. intent), the experiment design (A/B, phased rollout), how you will handle novelty and learning effects, and the minimum required test duration under weekday/weekend seasonality. 4) Describe instrumentation you need in logs (intents, escalation reason, intent confidence, user authentication status, handoff outcome) and a data-quality plan to detect labeling drift. 5) If randomization is infeasible, outline a quasi-experiment (e.g., difference-in-differences with matched stores) and list the assumptions you must validate to claim causality.