Describe influencing without authority
Company: Meta
Role: Data Scientist
Category: Behavioral & Leadership
Difficulty: medium
Interview Round: Onsite
Tell me about a time you disagreed with a senior engineer’s proposed design, but had no direct authority to change it. In your answer, use the STAR framework and include: (a) the concrete stakes, timeline, and measurable risks; (b) how you built alignment (stakeholder map, 1:1s, data/experiments used, decision docs); (c) the specific trade-offs you proposed and how you quantified impact (latency, reliability, cost); (d) how you handled conflict professionally when challenged in a live design review; (e) the final decision and measurable outcomes; and (f) what you would do differently if, months later, you found your alternative caused a regression under unexpected load.
Quick Answer: This question evaluates a candidate's ability to influence without formal authority, covering stakeholder management, cross-functional communication, conflict resolution, and technical trade-off quantification within a Behavioral & Leadership interview for a Data Scientist role.
Solution
# STAR Answer With Quantification and Influence Tactics
This answer models how to respond without naming proprietary details. It uses a data/science + platform example with concrete numbers, experiments, and stakeholder alignment.
## Situation
We were launching a near-real-time ranking improvement for notifications. The senior backend engineer proposed a synchronous, per-request fan-out to four microservices (social graph, embeddings, geo, and abuse signals) to enrich features at request time.
- Stake: Lift click-through rate (CTR) by +0.5 percentage points to hit a quarterly engagement target.
- SLA: End-to-end p95 latency ≤ 250 ms; availability ≥ 99.9%.
- Baseline: Existing pipeline p95 ≈ 120 ms.
- Timeline: 6 weeks to launch aligned to a marketing campaign.
- Risk: If p95 exceeds 250 ms or availability drops, we’d violate SLOs and risk suppressing notifications—a potential 1–2% drop in daily engaged sessions.
## Task
As the responsible data scientist for model performance and experiment design, I needed to ensure the design met latency/reliability SLOs and protected model quality, without owning the service architecture or team.
## Actions
1) Built Alignment and Gathered Evidence
- Stakeholder map:
- Senior Backend Engineer (design owner)
- EM (execution, SLO accountability)
- PM (business outcomes)
- SRE (reliability, on-call)
- Privacy reviewer (data flows)
- Partner notifications team (client integration)
- 1:1s: I met each stakeholder to understand constraints. I explicitly asked SRE for historical p95 and availability per dependency and PM for the business sensitivity of latency vs CTR.
- Data deep-dive:
- Latency tail risk: Pulled p95/p99 latency from each candidate dependency. Each had p95 ≈ 60–90 ms. In parallel fan-out, tail latency behaves like the max of the calls; you pay the worst tail.
- Reliability math: If each service availability is ~99.5%, synchronous fan-out to 4 services yields availability ≈ 0.995^4 = 0.980 (98.0%), i.e., ~2% failures before retries/timeouts. That alone would blow our 99.9% target.
- Feature freshness analysis: 92% of features changed less than every 30 minutes; only a small subset (<8%) were highly dynamic.
- Prototyping and experiments:
- Load test (Locust/k6) against a mock fan-out: at 2k QPS, end-to-end p95 rose to ~290 ms with retries; error rate ~1.7% under dependency p95 spikes.
- Shadow pipeline: I built a streaming materialization (Kafka + Flink) to precompute features into Redis with a 5-minute TTL; on-request, we’d only call an embeddings service for cold/missing features. Cache hit-rate in shadow: ~95%.
- Decision doc: Wrote a 4-page options doc with an options matrix (freshness, p95, p99, availability, cost, delivery risk). Included roll-out/rollback plan and SLIs/SLOs.
2) Proposed Quantified Trade-offs
- Design A (senior engineer’s original): per-request fan-out to 4 services, parallel calls.
- Latency: Measured p95 ≈ 290 ms (violates 250 ms). Tail compounded by retries.
- Availability: ≈ 0.995^4 = 0.980 (98%). With retries, availability improves but increases tail latency beyond SLO.
- Cost: 8M req/day × 4 calls × $1e-5 per call compute ≈ $320/day → ~$9.6k/month (excl. egress).
- Design B (my proposal): streaming precompute + Redis feature store (TTL 5 min) with a single on-demand call for hot, fast-changing features (~5% of requests), and a degradation fallback if that call fails.
- Latency: Cache fetch ≈ 3–5 ms; 5% requests incur an extra ~65 ms. Weighted end-to-end p95 measured ≈ 205 ms.
- Availability: Redis 99.99% plus rare on-demand call 99.5% → Effective success ≈ 0.95×0.9999 + 0.05×(0.9999×0.995) ≈ 99.94%.
- Freshness impact: Offline AUC delta with 5-min staleness ≈ −0.10%; online we projected CTR impact within ±0.05 pp of target.
- Cost: Redis ~$1.2k/month + 5% on-demand calls ≈ $16/day → ~$1.7k/month total, saving ~$7.9k/month vs Design A.
- Hybrid tweak: Dynamic TTLs (shorter for rapidly changing features, longer otherwise) + pre-warming caches for hot keys to boost hit rate from 95% to ~97%.
3) Navigated the Live Design Review Professionally
- I opened by restating the senior engineer’s goals and their design, validating the need for freshness.
- I framed the discussion around shared SLOs and launch risk, not personal preference.
- I showed side-by-side measurements and the simple reliability formula P(all succeed) = ∏(1 − p_i) to illustrate compounded risk.
- When challenged on freshness, I proposed a targeted A/B: route 10% traffic to dynamic TTLs for fast features only, with guardrails (kill-switch, error budget alerts, auto-rollback if p95 > 250 ms or failure rate > 0.5%).
- I invited SRE to comment on tail risk and PM to weigh business sensitivity to minor model staleness vs missed launch.
## Result
- Decision: We adopted the hybrid approach (streaming precompute + Redis + on-demand call only for fast-changing features, with dynamic TTLs and a circuit-breaker fallback).
- Delivery: Shipped in week 6 with a canary → region → global rollout.
- Outcomes (first 2 weeks):
- p95 latency: 205 ms (down from 290 ms in Design A test; within 250 ms SLO).
- Availability: 99.95% (vs projected 98% for the fan-out).
- CTR: +0.52 pp lift vs control (target was +0.50 pp); notification hides unchanged.
- Cost: ~$7.9k/month infra savings vs fan-out estimate.
- Operational: On-call pages reduced; no error budget burn.
## If My Alternative Later Caused a Regression Under Unexpected Load
Assume three months later, a seasonal traffic spike doubled QPS, streaming lag increased, cache thrash reduced hit rates from 97% to 85%, and CTR dropped 0.3 pp temporarily.
- Immediate response (hours):
- Trigger kill-switch to increase TTLs and reduce on-demand calls.
- Enable circuit breaker to fall back to batch-only features for overloaded segments.
- Rate-limit lowest-value notification types to protect SLOs.
- Launch a canary rollback to the more synchronous path for a small cohort to validate recovery.
- Short-term mitigations (days):
- Autoscale stream processors; increase Kafka partitions and consumer parallelism; add backpressure.
- Convert cache policy from LRU to LFU + admission control to reduce churn; pre-warm hot keys.
- Partition Redis by keyspace and increase memory to restore >95% hit rate.
- Add dynamic TTLs tied to observed update rates and queue lag.
- Long-term hardening (weeks):
- Capacity planning: run load tests at 2–3× peak with fault injection (latency spikes, partial outages).
- SLOs and error budgets for freshness (e.g., % features older than threshold) in addition to latency/availability.
- Feature criticality routing: degrade gracefully by dropping least-informative features first; maintain a robust heuristic fallback.
- Rollout guardrails: canary by region, progressive traffic ramps, automatic rollback on SLO breach.
- Retrospective learning:
- Blameless postmortem; document decisions and new runbooks.
- Decision doc addendum: explicitly model cache hit-rate sensitivity and streaming lag; include a "holiday spike" scenario.
## Why This Works (for Interviewers and Candidates)
- Uses STAR while quantifying stakes, latency, reliability, and cost.
- Demonstrates influence: stakeholder mapping, 1:1s, decision docs, and experiments.
- Shows professional conflict handling with data and shared goals.
- Provides guardrails (canary, kill-switches, SLOs) and a learning loop for regressions.
Key formulas and checks used:
- Tail in parallel fan-out is dominated by max(latency_i); p95 of max grows with number of dependencies.
- Availability for independent calls: P(success) = ∏(availability_i).
- Weighted latency with cache: p95 ≈ cache_latency × hit_rate + (cache_latency + call_latency) × miss_rate (approximation; validate with measurement).
- Cost modeling: requests/day × calls/request × cost/call + fixed cache cost (validate with infra dashboards).