Technical Fundamentals for Non-Technical Product Managers
Asked of: Product Manager
Last updated

What's being tested
Interviewers are checking whether you, as a Product Manager, can translate vague operational or technical signals into a prioritized, measurable product response without doing engineering-level design. They want to see practical fluency with core tradeoffs — latency vs. throughput, consistency vs. freshness, monitoring and alerting signals (e.g., p99, p50, error rate), and how those translate into user-facing impact and roadmap decisions. DoorDash cares because PMs must scope fixes, set SLAs, decide experiments versus rollbacks, and communicate risk to stakeholders quickly and accurately.
Core knowledge
-
Latency: the time for an operation; monitor
p50,p95,p99. Spikes atp99often point to tail issues (resource contention, retries), whilep50shifts indicate systemic slowness. -
Throughput: requests/sec or orders/sec; capacity planning uses throughput × average latency to estimate concurrency needs. Use Little’s Law: (concurrent requests = arrival rate × avg latency).
-
Service Level Objective (SLO) / Service Level Agreement (SLA): define measurable targets (e.g., 99% of checkouts < 500ms). Convert SLO violations to business impact (conversion loss per minute).
-
API vs. UI problems: distinguish client-side issues (network, SDK) from server-side problems by asking for
client logs,synthetic monitoring, and server traces; prioritize fixes by user impact slices (new users, high-ARPU). -
Idempotency & retries: for payment and order creation, require idempotency keys to prevent duplicates; absence causes data integrity incidents rather than latency problems.
-
Caching & CDNs: use
Redisor edgeCDNfor read-heavy assets (menus, static images). Cache hit rate >70% typically justifies added complexity; invalidation adds product complexity (staleness). -
Data stores and scale tradeoffs:
Postgreshandles strong consistency and complex queries up to tens of millions rows; for massive scale or high write throughput consider sharding or a write-optimized store.NoSQLtrades complex queries for horizontal scale and faster writes. -
Event/queue patterns: asynchronous processing (
Kafka, task queues) decouples user-facing latency from background work but introduces eventual consistency and failure modes you must accept or mitigate in UX. -
Observability: product PMs should request three pillars — metrics (counts/latencies), logs (errors), traces (distributed request path). Instrument feature flags and user cohorts to measure rollouts.
-
Error budget decisioning: map SLO breaches to action: small, short breach → throttling or rollback; chronic breach → invest in root-cause fixes. Quantify cost of downtime vs. engineering effort.
-
Consistency vs. freshness: for order tracking choose strong consistency for payment status, eventual consistency for courier location; explicitly document user-visible expectations.
-
Security & compliance guardrails: PCI-sensitive flows must avoid client-side logging of payment details and use tokenization; treat these as non-negotiable constraints when scoping product fixes.
-
Instrumenting experiments: build guardrails (statistical power, guardrail metrics like payment failures) before running experiments that touch checkout or billing flows.
Tip: Always convert technical metrics into user/business impact (e.g., “2% increase in
p99checkout latency → estimated $X/day lost revenue”).
Worked example — Investigate a p99 latency spike for checkout
Frame: start by clarifying scope — ask when the spike began, percentage of affected users, whether it's across all regions or a cohort (new vs. returning), and recent deploys or config changes. Skeleton of the response: (1) Triage: identify whether spike is server-side, network, or client; (2) Isolate: correlate with deploys, feature flags, traffic surges, upstream dependency failures; (3) Hypothesize & mitigate: apply short-term mitigations (rollback, increase replicas, temporary throttling) prioritized by user impact; (4) Diagnose & fix: root-cause analysis with traces and logs, then prioritize permanent fixes on roadmap. Tradeoff: you must weigh a fast rollback (low user friction) against the risk of reverting important features; explicitly quantify rollback benefit vs. engineering cost. Close by stating next steps: add synthetic monitors for the failing flow, instrument finer-grained traces, run a post-incident RCA and convert findings into roadmap tickets and an SLA adjustment if needed. If more time: propose A/B experiments to test different mitigation strategies and measure conversion recovery.
A second angle — Design a real-time order-tracking API
Framing differs: the problem is availability and freshness tradeoffs rather than pure latency. Start by clarifying UX: how fresh must courier location be, offline tolerances, mobile battery constraints. Main pillars: choose push (server-sent events / websockets) vs pull (polling) based on scale and network reliability; decide consistency model (eventual vs per-update acknowledgement); define contract (endpoints, fields, error codes) and SLOs for update frequency (e.g., location updates every 10s, 99% delivered within 15s). Consider tradeoffs: push reduces client polling cost but increases server state and connection management complexity; choose exponential backoff and best-effort delivery semantics for unstable networks. Close by planning telemetry: track delivery latency distribution, connection churn, and per-cohort accuracy to inform future performance investments.
Common pitfalls
Pitfall: Treating latency metrics in isolation.
PMs often focus on average latency improvements without linking to conversion or error-rate changes. Better: always map metric deltas to user outcomes and revenue impact before prioritizing work.
Pitfall: Demanding a single “fix this now” without scope.
Engineers and SREs need a prioritized plan. Don’t ask for a full design; request short-term mitigations, impact estimates, and a timeline for a permanent fix.
Pitfall: Over-specifying implementation.
Avoid prescribing specific technologies (e.g., “use Kafka sharding here”) unless you have strong evidence. State constraints, acceptance criteria, and expected user outcomes; let engineering select the best implementation.
Connections
Operational troubleshooting questions often lead to adjacent pivots: experiment design (if you propose an A/B rollback), data instrumentation/analytics (to quantify user impact), and reliability/incident management (SRE processes and runbooks). Be ready to move between product prioritization and measurable engineering outcomes.
Further reading
-
[Designing Data-Intensive Applications — Martin Kleppmann] — conceptual tradeoffs between consistency, durability, and scalability; excellent for PM-level system tradeoff language.
-
[Site Reliability Engineering — Google] — practical frameworks for SLOs, error budgets, and incident response that PMs should quote when discussing reliability.
Related concepts
- Technical Fundamentals for Non-Technical Product Managers
- PM Technical Fundamentals for Growth Experimentation
- Experimentation, Diagnostics, and Growth Infrastructure for Non-Technical PMs
- A/B Testing and Growth Infrastructure for Non-Technical PMs
- Technical Leadership, Impact, And Trade-OffsBehavioral & Leadership
- Technical Leadership, Project Impact And TradeoffsBehavioral & Leadership