Technical Fundamentals for Non-Technical Product Managers

What's being tested

Interviewers are checking whether you, as a Product Manager, can translate vague operational or technical signals into a prioritized, measurable product response without doing engineering-level design. They want to see practical fluency with core tradeoffs — latency vs. throughput, consistency vs. freshness, monitoring and alerting signals (e.g., p99, p50, error rate), and how those translate into user-facing impact and roadmap decisions. DoorDash cares because PMs must scope fixes, set SLAs, decide experiments versus rollbacks, and communicate risk to stakeholders quickly and accurately.

Core knowledge

Latency: the time for an operation; monitor p50, p95, p99. Spikes at p99 often point to tail issues (resource contention, retries), while p50 shifts indicate systemic slowness.
Throughput: requests/sec or orders/sec; capacity planning uses throughput × average latency to estimate concurrency needs. Use Little’s Law: $L=\\lambda W$ (concurrent requests = arrival rate × avg latency).
Service Level Objective (SLO) / Service Level Agreement (SLA): define measurable targets (e.g., 99% of checkouts < 500ms). Convert SLO violations to business impact (conversion loss per minute).
API vs. UI problems: distinguish client-side issues (network, SDK) from server-side problems by asking for client logs, synthetic monitoring, and server traces; prioritize fixes by user impact slices (new users, high-ARPU).
Idempotency & retries: for payment and order creation, require idempotency keys to prevent duplicates; absence causes data integrity incidents rather than latency problems.
Caching & CDNs: use Redis or edge CDN for read-heavy assets (menus, static images). Cache hit rate >70% typically justifies added complexity; invalidation adds product complexity (staleness).
Data stores and scale tradeoffs: Postgres handles strong consistency and complex queries up to tens of millions rows; for massive scale or high write throughput consider sharding or a write-optimized store. NoSQL trades complex queries for horizontal scale and faster writes.
Event/queue patterns: asynchronous processing (Kafka, task queues) decouples user-facing latency from background work but introduces eventual consistency and failure modes you must accept or mitigate in UX.
Observability: product PMs should request three pillars — metrics (counts/latencies), logs (errors), traces (distributed request path). Instrument feature flags and user cohorts to measure rollouts.
Error budget decisioning: map SLO breaches to action: small, short breach → throttling or rollback; chronic breach → invest in root-cause fixes. Quantify cost of downtime vs. engineering effort.
Consistency vs. freshness: for order tracking choose strong consistency for payment status, eventual consistency for courier location; explicitly document user-visible expectations.
Security & compliance guardrails: PCI-sensitive flows must avoid client-side logging of payment details and use tokenization; treat these as non-negotiable constraints when scoping product fixes.
Instrumenting experiments: build guardrails (statistical power, guardrail metrics like payment failures) before running experiments that touch checkout or billing flows.

Tip: Always convert technical metrics into user/business impact (e.g., “2% increase in p99 checkout latency → estimated $X/day lost revenue”).

Worked example — Investigate a `p99` latency spike for checkout

Frame: start by clarifying scope — ask when the spike began, percentage of affected users, whether it's across all regions or a cohort (new vs. returning), and recent deploys or config changes. Skeleton of the response: (1) Triage: identify whether spike is server-side, network, or client; (2) Isolate: correlate with deploys, feature flags, traffic surges, upstream dependency failures; (3) Hypothesize & mitigate: apply short-term mitigations (rollback, increase replicas, temporary throttling) prioritized by user impact; (4) Diagnose & fix: root-cause analysis with traces and logs, then prioritize permanent fixes on roadmap. Tradeoff: you must weigh a fast rollback (low user friction) against the risk of reverting important features; explicitly quantify rollback benefit vs. engineering cost. Close by stating next steps: add synthetic monitors for the failing flow, instrument finer-grained traces, run a post-incident RCA and convert findings into roadmap tickets and an SLA adjustment if needed. If more time: propose A/B experiments to test different mitigation strategies and measure conversion recovery.

A second angle — Design a real-time order-tracking API

Framing differs: the problem is availability and freshness tradeoffs rather than pure latency. Start by clarifying UX: how fresh must courier location be, offline tolerances, mobile battery constraints. Main pillars: choose push (server-sent events / websockets) vs pull (polling) based on scale and network reliability; decide consistency model (eventual vs per-update acknowledgement); define contract (endpoints, fields, error codes) and SLOs for update frequency (e.g., location updates every 10s, 99% delivered within 15s). Consider tradeoffs: push reduces client polling cost but increases server state and connection management complexity; choose exponential backoff and best-effort delivery semantics for unstable networks. Close by planning telemetry: track delivery latency distribution, connection churn, and per-cohort accuracy to inform future performance investments.

Common pitfalls

Pitfall: Treating latency metrics in isolation.

PMs often focus on average latency improvements without linking to conversion or error-rate changes. Better: always map metric deltas to user outcomes and revenue impact before prioritizing work.

Pitfall: Demanding a single “fix this now” without scope.

Engineers and SREs need a prioritized plan. Don’t ask for a full design; request short-term mitigations, impact estimates, and a timeline for a permanent fix.

Pitfall: Over-specifying implementation.

Avoid prescribing specific technologies (e.g., “use Kafka sharding here”) unless you have strong evidence. State constraints, acceptance criteria, and expected user outcomes; let engineering select the best implementation.

Connections

Operational troubleshooting questions often lead to adjacent pivots: experiment design (if you propose an A/B rollback), data instrumentation/analytics (to quantify user impact), and reliability/incident management (SRE processes and runbooks). Be ready to move between product prioritization and measurable engineering outcomes.

What's being tested

Core knowledge

Worked example — Investigate a `p99` latency spike for checkout

A second angle — Design a real-time order-tracking API

Common pitfalls

Connections

Further reading

Related concepts

What's being tested

Core knowledge

Worked example — Investigate a p99 latency spike for checkout

A second angle — Design a real-time order-tracking API

Common pitfalls

Connections

Further reading

Related concepts

Worked example — Investigate a `p99` latency spike for checkout