Design email to avoid Promotions without online tests
Company: Tencent
Role: Data Scientist
Category: Machine Learning
Difficulty: hard
Interview Round: Technical Screen
You must finalize the design of an in‑game transactional email before any send. Goal: minimize the probability mailbox providers (e.g., Gmail/Outlook) classify it as Promotional/Spam. Constraint: you cannot run any online A/B tests or get post‑send user feedback; the decision must be made entirely offline. You can use historical data from the last 12 months of game emails that includes: send_id, subject, body_html, num_links, link_domains (e.g., game.com, store.game.com, help.game.com, partner.com), anchor_text_types (CTA vs neutral), num_images, template_id, sender_reputation_metrics (complaint_rate_7d, bounce_rate_7d, domain_dkim_pass), send_time_utc, segment (region, platform), provider (gmail/outlook/yahoo), and label from seed inboxing tests or logs: folder_label in {primary, promotions, spam}. Design variables you control now: number of links (1–5), which domains are linked, anchor text style (CTA vs neutral), subject tokens, presence of hero image. a) Formulate an offline risk‑minimization problem to choose the design (decision vector) that minimizes P(Promotions/Spam) subject to constraints (e.g., must include at least one help link; subject length ≤ 60; no partner.com link if risk > threshold). Write the objective, constraints, and any robustness term you would include (e.g., worst‑case over providers or conformal upper bounds). b) Specify features and a modeling approach to estimate risk, including how you will avoid leakage (e.g., using time‑based splits, template_id handling) and calibrate probabilities. c) Describe how you will address covariate shift between historical templates and the proposed new design (e.g., domain adaptation, monotonic constraints on num_links, or semi‑synthetic data generation). d) Propose a search/optimization strategy over the discrete design space (e.g., beam search with learned surrogate, Bayesian optimization with mixed variables, or ILP with learned risk). e) Explain how you will validate the chosen design offline without any new user feedback (e.g., off‑policy evaluation with inverse propensity/importance weighting given historic sending policies, stratified provider‑wise risk, and conformal prediction intervals). f) If historical labels are scarce or noisy for some providers, propose a fallback (e.g., weak labeling via open‑source classifier plus small hand‑labeled set) and how you would quantify added uncertainty in the final decision. g) Deliver a concrete decision rule (e.g., select the lowest‑risk design whose worst‑case provider risk upper bound at 90% confidence is below X%) and justify your chosen X.
Quick Answer: This question evaluates a data scientist's competency in offline risk‑minimization for email deliverability, covering probabilistic risk estimation, robust optimization, calibration, and covariate‑shift handling within a Machine Learning framework.