Use data to resolve an ambiguous problem
Company: Capital One
Role: Data Scientist
Category: Behavioral & Leadership
Difficulty: hard
Interview Round: Technical Screen
Tell me about a time you used data to solve a poorly defined business problem end-to-end. Specify the exact hypothesis, the data sources and their known biases, your statistical method or model selection, and how you validated assumptions. Quantify impact with a defensible counterfactual (e.g., difference-in-differences vs. naive before–after). Include how you handled data quality issues, communicated uncertainty to stakeholders, and one thing you would improve if you had 10% more data.
Quick Answer: This question evaluates a data scientist's competency in end-to-end data analysis, causal identification, statistical modeling, data quality and governance, bias assessment, and communication of uncertainty when addressing an ambiguous business problem.
Solution
# Example answer (STAR + causal rigor)
## Situation (poorly defined problem)
Delinquencies on a revolving credit product had been creeping up over several months. Product leadership asked, "Can we nudge customers to enroll in AutoPay to reduce missed payments without increasing risk or harming customer experience?" The problem was ambiguous: multiple levers (UI, timing, incentives), unclear target segment, and macro seasonality confounded before–after comparisons.
## Hypothesis (testable)
Showing an in‑app prompt at the bill‑view moment to eligible customers will increase AutoPay enrollment within 14 days and reduce 60‑day delinquency, without adverse effects on charge‑off or complaints.
Primary outcome: AutoPay enrollment within 14 days. Secondary outcomes: 60‑day delinquency (DPD60), charge‑off rate, customer complaints within 30 days.
## Data sources and known biases
- App event logs (screen views, prompt impressions/clicks)
- Biases: app‑only users differ from web/phone users (selection bias); session‑level duplication; timezone drift.
- Payments and statement data (due dates, posted payments, late fees)
- Biases: end‑of‑month seasonality; partial payments; cutoff timing.
- Risk and bureau attributes (internal risk score, external score bands)
- Biases: refresh cadence; missingness for new accounts; regulatory constraints on usage.
- Customer profile and eligibility flags (AutoPay availability, account tenure)
- Biases: survivorship bias (closed accounts absent); eligibility changes mid‑experiment.
- Marketing exposures (email/SMS pushes)
- Biases: cross‑channel interference; incomplete logging on some legacy campaigns.
Mitigations included stratified randomization, exposure logging, timezone normalization, and pre‑registration of metrics/windows.
## Method and design
- Causal strategy: randomized controlled experiment for the prompt; difference‑in‑differences (DiD) for delinquency to adjust for temporal shocks.
- Unit of randomization: customer (not session) to avoid spillover across sessions; 50/50 split.
- Stratification variables: risk band, tenure, region, due‑date week to improve balance and power.
- Sample: 200k eligible customers (100k treatment, 100k control) over one billing cycle; power analysis targeted a minimum detectable effect of 1.0 percentage point on AutoPay with 90% power.
- Estimation:
- AutoPay: difference in proportions (intent‑to‑treat), with stratification covariates in a logistic regression for precision.
- DPD60: two‑period DiD using customer‑level panel (pre = prior 2 months, post = experiment month), estimated via OLS with customer and time fixed effects and cluster‑robust SEs by customer.
- Heterogeneous effects (for rollout targeting): uplift modeling via causal forests on treatment × features.
Why not naive before–after? Seasonality and macro trends (tax season) significantly shift delinquency; DiD addresses this via a contemporaneous control trend.
## Assumption checks and validation
- Randomization balance: standardized mean differences across 20+ covariates all < 0.05.
- SUTVA/interference: holdout users were never shown the prompt; cross‑device exposures monitored; contamination < 1%.
- Parallel trends (DiD): pre‑period DPD60 trends (3 months) had slope differences statistically indistinguishable from zero; placebo DiD on pre‑period months showed no effect.
- Overlap: all risk bands represented in both groups due to stratification.
- Model diagnostics: logistic model well‑calibrated (Brier score 0.14), no high VIFs; causal forest out‑of‑bag uplift AUC = 0.62.
## Data quality issues and fixes
- User ID stitching: app and core systems had rare ID mismatches. Fixed with deterministic keys plus fuzzy match fallback, then dedup; added unit tests to ensure 1:1 mapping.
- Timezones/cutoffs: unified all event timestamps to UTC and aligned to statement cutoff windows; precomputed daily snapshots to avoid late‑arriving data bias.
- Missing bureau scores: 6% missing; imputed with "missing" category for modeling (not for treatment assignment) and ran sensitivity excluding these users—effects consistent.
- Exposure logging gaps: legacy email campaigns could confound results; we paused overlapping campaigns for the sample and logged any exceptions.
## Results with a defensible counterfactual
- AutoPay enrollment (14‑day):
- Treatment: 27.1% (27,100/100,000)
- Control: 24.0% (24,000/100,000)
- Uplift: +3.1 percentage points (pp)
- 95% CI: ±0.4 pp (SE ≈ 0.20 pp), p < 1e‑10
- DPD60 (difference‑in‑differences):
- Pre DPD60: both groups ≈ 5.0%
- Post DPD60: control 5.6%, treatment 4.8%
- DiD estimate τ̂ = (4.8 − 5.0) − (5.6 − 5.0) = −0.8 pp
- 95% CI: [−1.1 pp, −0.5 pp] (cluster‑robust)
Why DiD matters: A naive before–after on treatment alone would suggest −0.2 pp, missing the macro uptick shown by the control (+0.6 pp). DiD recovers the causal effect of −0.8 pp.
Back‑of‑the‑envelope impact (illustrative, using expected credit loss ECL = PD × LGD × EAD):
- Eligible monthly population at rollout: 2.0M customers.
- Fewer DPD60 accounts: 0.8% × 2.0M = 16,000 per month.
- Assumptions: conditional charge‑off probability 30%; LGD 80%; average EAD $1,500.
- Monthly ECL reduction ≈ 16,000 × 0.30 × 0.80 × $1,500 = $5.76M; annualized ≈ $69M.
- Late fee revenue loss from fewer late payments: ≈ $0.7M/month (based on historical fee incidence), net ≈ $5.1M/month.
Sensitivity: Varying charge‑off probability ±5 pp changes the annualized benefit by ≈ ±$11M; we presented a tornado chart covering key drivers.
## Communication of uncertainty and decision
- Reported point estimates with 95% CIs and MDEs; emphasized intent‑to‑treat estimates to reflect real‑world adherence.
- Framed trade‑offs: customer benefit (fewer fees), risk reduction, and revenue impact; aligned with compliance and customer fairness.
- Decision: full rollout to all eligible users, with targeted prioritization based on uplift model for high‑impact segments; guardrails on complaint rate and any adverse risk shifts.
## One improvement with +10% more data
- Improve heterogeneity estimation: With 10% more users, we could more precisely learn segment‑level uplift (especially in thin strata like new‑to‑credit), enabling a tighter treatment policy (e.g., top decile uplift only), which simulations suggest would add ≈ 10–15% incremental ECL reduction for the same exposure volume.
## Key formulas (for clarity)
- Difference‑in‑Differences: τ̂ = (Ȳ_treat,post − Ȳ_treat,pre) − (Ȳ_ctrl,post − Ȳ_ctrl,pre)
- Intent‑to‑Treat uplift on AutoPay: Δ = p_treat − p_ctrl; SE(Δ) ≈ sqrt[p(1 − p)(1/n_treat + 1/n_ctrl)]
## Pitfalls and how we avoided them
- Naive before–after: used DiD and pre‑trend checks.
- Contamination: enforced holdout, monitored cross‑channel exposures.
- P‑hacking: pre‑registered metrics and windows; no peeking across interim looks.
- Overfitting uplift: out‑of‑bag validation and monotonicity checks; conservative thresholds for rollout.