Explain your most impactful project end-to-end
Company: BitGo
Role: Software Engineer
Category: Behavioral & Leadership
Difficulty: medium
Interview Round: Technical Screen
Walk me through your most impactful recent project end-to-end: problem statement, stakeholders, success metrics, high-level architecture, key technical decisions and tradeoffs, your exact contributions, timeline, and results. Describe one major challenge, how you investigated it, your solution, and the measurable impact. If you could redo the project, what would you change and why?
Quick Answer: This question evaluates project leadership, end-to-end technical ownership, system design reasoning, stakeholder management, and impact measurement competencies within software engineering.
Solution
# How to structure your answer (template)
Aim for a concise, structured 5–7 minute story:
1) Problem and context
- What was broken or missing? Why now? Quantify the pain (latency, error rate, cost, incidents, growth).
2) Stakeholders
- Who depended on it (users, teams, security/compliance, SRE)? What did each care about?
3) Success metrics
- Define explicit targets (e.g., p95 latency < 250 ms, 99.99% availability, <0.1% error rate, +3x throughput, -30% cost).
4) High-level architecture
- Name the main services, data stores, interfaces (gRPC/REST), eventing, caches, infra, observability.
5) Key decisions and tradeoffs
- Storage choice, sync vs async, consistency vs availability, schema/versioning, language/runtime, retries/idempotency, security.
6) Your contributions
- Design docs, core modules you built, migrations, reviews, experiments, on-call, rollout.
7) Timeline
- Phases with weeks/months and milestones (discovery, design, build, test, canary, GA).
8) Results
- Before vs after on the success metrics. Include business or user impact.
9) Major challenge
- Root cause discovery process, options considered, final solution, measurable impact after fix.
10) Retrospective
- What you would do differently and why (process, tech, rollout, testing, metrics).
---
## Example answer (SWE, fintech custody context)
1) Problem and context
- We needed to replace a monolithic transaction policy engine that evaluated approval rules (limits, geofencing, velocity, multi-approver workflows). It was a bottleneck during peak times.
- Pain points before: p95 latency ~2.4 s, 2–3 on-call pages/week from timeouts, throughput capped at ~600 TPS, rule changes required redeploys.
2) Stakeholders
- Wallet/transactions API team (needed low latency and reliability).
- Compliance and Risk (needed auditability, deterministic decisions, easy rule updates).
- SRE (needed 99.99% availability and manageable on-call).
- Security (HSM/KMS integration, least-privilege, tamper-proof audit).
3) Success metrics
- p95 decision latency < 250 ms, p99 < 500 ms.
- 99.99% availability for decision API.
- 2,000 TPS sustained with headroom to 3,000 TPS bursts.
- < 0.1% error rate; zero double-approvals.
- Rule changes deploy-free with < 5 min propagation.
4) High-level architecture
- gRPC decision service (stateless) reading policy snapshots from Postgres and warm caching in Redis.
- Event-driven updates: policy changes published on Kafka; services subscribe and refresh caches atomically.
- Idempotency and audit layer: Postgres tables with unique constraints on (tenant_id, request_hash) and append-only decisions log.
- Observability: OpenTelemetry traces, Prometheus metrics, centralized logs with correlation IDs.
- Security: KMS for secrets, signed policy bundles, least-privilege IAM, WAF + mTLS.
Data flow (simplified):
- Client -> Transactions API -> Decision gRPC -> Redis cache (hit) or Postgres (miss) -> Decision returned -> Append audit log.
- Policy Console -> Writes to Postgres -> Publishes Kafka event -> Decision service refreshes in-memory/Redis snapshot.
5) Key decisions and tradeoffs
- Sync gRPC for request/response decisions vs async events: chose sync for user-facing latency; complemented with async Kafka for cache invalidation.
- Relational store (Postgres) for policy and audit vs NoSQL: chose Postgres for strong consistency, transactions, and rich querying; used Redis for performance.
- Policy expression: adopted CEL-like DSL instead of a custom rules engine to avoid lock-in and enable safe dynamic updates.
- Consistency vs availability: prioritized availability with at-least-once cache updates and strong constraints in the write path to prevent double approvals.
- Caching strategy: versioned policy snapshots with TTL and ETags to avoid stale reads; fall back to last-known-good snapshot under partial outages.
6) My contributions
- Drove the design doc and ADRs; aligned stakeholders on SLIs/SLOs.
- Implemented the decision service: gRPC API, rule evaluation pipeline, Redis cache strategy, and idempotency/audit schema.
- Added OpenTelemetry tracing and golden signal dashboards; wrote load and chaos test harness.
- Led canary rollout and incident response playbooks; mentored two engineers through feature modules and code reviews.
7) Timeline
- Weeks 1–2: Discovery, baseline metrics, SLIs/SLOs.
- Weeks 3–4: Design doc and reviews; build vs buy evaluation; threat model.
- Weeks 5–9: Implementation; integration tests; staging load tests to 2,500 TPS.
- Week 10: Chaos tests (DB failover, cache eviction storms, network jitter).
- Weeks 11–12: Canary (10% -> 50% -> 100%) with auto-rollbacks.
- Week 13: GA and decommission of legacy.
8) Results
- p95 latency improved from ~2.4 s to 180 ms; p99 to 420 ms.
- Availability from ~99.5% to 99.99%; on-call pages dropped 70%.
- Sustained 2,200 TPS with bursts to ~3,100 TPS; CPU usage -25% due to caching and compiled rule plans.
- Compliance: rule updates live within 2 minutes, no deploys; full decision audit trail with cryptographic hashes.
9) Major challenge: duplicate approvals during retries
- Symptom: During canary, we observed rare duplicate approvals in audit logs when upstream timeouts caused client retries.
- Investigation: Used trace IDs to correlate retries; reproduced by injecting 502s and timeouts. Found the legacy path used check-then-insert without atomicity; our new path initially relied on app-level dedupe only.
- Solution:
- Introduced mandatory Idempotency-Key per request; server derives a request_hash from normalized inputs.
- Enforced a unique constraint on (tenant_id, request_hash) with an atomic UPSERT, returning the prior decision when present.
- Tuned retry policy with exponential backoff + jitter; set circuit breaker thresholds; made the client retry-safe.
- Added de-dupe metrics and alerts; backfilled a one-time reconciliation job to flag historical dupes (none user-visible).
- Impact: Duplicate approvals dropped to 0 over 4 weeks; error rate fell from 0.34% to 0.06%; eliminated a class of incidents tied to retries.
10) Retrospective: what I'd change
- Introduce idempotency and unique constraints from day one, not after canary.
- Run failure-injection and load testing earlier (pre-merge) with contract tests for retry semantics.
- Adopt a standard policy engine (e.g., OPA/CEL) even sooner to avoid building DSL tooling.
- Define SLIs and error budgets before sprint 1 to drive tradeoffs faster.
---
## Guardrails and validation tips
- Quantify before/after with the same measurement method (cold vs warm cache, steady-state vs burst).
- Proactively call out error budgets, rollback criteria, and canary gates.
- Name one or two tradeoffs you intentionally did not optimize (e.g., chose availability over strict serializable transactions in the read path) and why.
- If you lack a production story, use a personal project or OSS contribution but still attach metrics (build times, test flakiness, perf delta, adoption).
Use this structure to adapt any project; swap in your domain, metrics, and architecture while keeping the flow identical.