TikTok Software Engineer Interview Prep Guide
Everything TikTok actually asks Software Engineer candidates — concept walkthroughs, worked examples, and the real interview questions, drawn from candidate reports. Free to read.
Last updated

Technical Screen
Coding & Algorithms
-
Binary Tree Algorithms — covered in depth under Take-home Project below.
-
Graph, Grid, And Connectivity Algorithms — covered in depth under Onsite below.

What's being tested
Tests string segmentation with variable-length metadata, where each chunk must fit limit including a suffix like <i/n>. The key skill is deriving the minimum feasible page count, then slicing the original message without reordering or dropping characters.
Patterns & templates
-
Feasibility check for a candidate
n: total payload capacity is ; require capacity>= len(message). -
Digit-length bucketing avoids repeated string conversion:
len(str(i))is constant over ranges[1,9],[10,99], etc. -
Linear search over page count is often acceptable when
message.lengthis moderate; otherwise optimize with digit buckets, not naive per-page recomputation. -
Impossible-case guard: if any suffix length
len("<i/n>") >= limit, that page has no payload capacity, so candidatenis invalid. -
Two-pass construction: first find minimal feasible
n, then build chunks by takinglimit - suffix.lengthcharacters and appending suffix. -
Complexity target:
O(L + n)time andO(L + n log n)output space, whereL = message.length; avoid quadratic string concatenation. -
Use
StringBuilder, list append, or substring indices; repeatedresult += chunkcan degrade toO(L^2)in many languages.
Common pitfalls
Pitfall: Treating suffix length as constant;
<9/10>and<10/10>have different widths.
Pitfall: Choosing
n = ceil(len(message) / limit)before accounting for suffix overhead, which underestimates page count.
Pitfall: Returning non-minimal valid segmentation when the problem asks for the smallest number of parts.
Practice these
The practice cards below cover the canonical variants — solve all of them and time yourself.
Practice questions

What's being tested
Match-3 board simulation tests grid scanning, repeated state mutation, and termination when no more eliminations exist. Interviewers are probing whether you can separate detection from deletion, apply gravity correctly, and reason about complexity over multiple cascades.
Patterns & templates
-
Two-phase update: first mark all crushable cells, then mutate; deleting during scan causes missed overlapping horizontal/vertical matches.
-
Row/column run detection: scan consecutive equal nonzero values with
whileloops; mark runs wherelength >= 3. -
Marker encoding: use negative values or a separate
to_crushboolean grid; preserve original candy type while detecting overlapping groups. -
Gravity compaction: for each column, write non-crushed values from bottom to top using
write_row, then fill remaining cells with0. -
Fixed-point iteration: repeat
detect -> crush -> gravityuntil no cells are marked; worst-case time isO(kmn)forkcascades. -
Connected group variant: if removal is by connected components instead of straight runs, use BFS/DFS with
visitedand size threshold checks. -
Clean function split: implement
find_matches(board),apply_gravity(board), andresolve(board)to simplify debugging and interviewer follow-ups.
Common pitfalls
Pitfall: Mutating cells to
0during detection can break detection of crossing matches, such as a horizontal and vertical run sharing one cell.
Pitfall: Gravity must preserve vertical order within each column; sorting or rebuilding incorrectly changes relative candy positions.
Pitfall: Forgetting the loop termination condition leads to returning after one crush instead of resolving all cascades.
Practice these
The practice cards below cover the canonical variants — solve all of them and time yourself.
Practice questions
- Array Search, Selection, And Dynamic Programming — covered in depth under Take-home Project below.

What's being tested
You’re being tested on mutable data structure invariants: how to update pointers, indices, maps, queues, and tree links without breaking correctness. The strongest answers combine the right structure pairings—hash map + list, array + index set, deque + window—with clear O(1) or O(n) complexity arguments.
Patterns & templates
-
Hash map + doubly linked list for
LRUCache—get/putinO(1); always move touched nodes to head. -
Array + hash map of index sets for randomized multiset —
insert,remove,getRandomaverageO(1); swap-delete from array tail. -
Monotonic deque for sliding maximum — keep indices in decreasing value order; pop expired indices before reading window max.
-
Pointer rewiring for linked-list deletion — use dummy head, track
prev/curr; handle deleting head, tail, and repeated values. -
Binary tree deletion via replacement — for BSTs, replace with inorder successor/predecessor; for general trees, define deletion semantics first.
-
Shape normalization for distinct islands — DFS/BFS record relative offsets or path signatures; avoid absolute coordinates and traversal-order ambiguity.
-
Frequency aggregation with heaps or maps — use
Counter/hash map first, then heap/bucket depending on required ordering and complexity.
Common pitfalls
Pitfall: Claiming
O(1)deletion from an array-backed randomized set without explaining the swap-with-last index update.
Pitfall: Forgetting stale indices in a sliding-window deque, causing maxima from outside the current window.
Pitfall: Mutating linked lists or trees without covering root/head deletion, single-node structures, duplicates, and null children.
Practice these
The practice cards below cover the canonical variants — solve all of them and time yourself.
Practice questions
Software Engineering Fundamentals

What's being tested
Interviewers are probing whether you understand Java framework internals, concurrency correctness, and design tradeoffs beyond surface-level API usage. For TikTok-scale services, a Software Engineer must reason about how Spring creates objects, how proxies intercept calls, how transactions behave under failure, and how thread pools or locks affect latency and correctness. The expected signal is not memorization; it is the ability to explain mechanisms, identify edge cases, and make safe implementation choices under high concurrency.
Core knowledge
-
Dependency injection separates object construction from business logic: classes declare dependencies through constructors, setters, or fields, and a container such as
Springwires concrete implementations. Constructor injection is usually preferred because it supports immutability, easier testing, and fail-fast detection of missing dependencies. -
Bean lifecycle in
Springtypically flows through classpath scanning or explicit configuration, bean definition creation, instantiation, dependency injection, initialization callbacks such as@PostConstruct, proxy wrapping, and destruction callbacks. A common edge case: the object you call may be a proxy, not the raw class instance. -
Aspect-oriented programming applies cross-cutting behavior such as logging, authorization, metrics, retries, or transactions around method calls.
Spring AOPcommonly uses JDK dynamic proxies for interfaces andCGLIBsubclass proxies for concrete classes; limitations include final classes, final methods, and internal self-invocation bypassing the proxy. -
Transaction management in
Springusually works by intercepting a method annotated with@Transactional, opening or joining a transaction, committing on success, and rolling back on configured exceptions. By default, rollback happens for unchecked exceptions, while checked exceptions require explicitrollbackFor. -
Transaction propagation controls how nested calls behave:
REQUIREDjoins an existing transaction or creates one,REQUIRES_NEWsuspends the current transaction, andNESTEDuses savepoints if supported. Interviewers often test whether you know that propagation only matters when calls pass through the transaction proxy. -
Isolation levels define what anomalies are allowed:
READ_COMMITTEDprevents dirty reads,REPEATABLE_READprevents non-repeatable reads in many databases, andSERIALIZABLEgives strongest correctness with lower concurrency. You should connect this to real bugs like double-spending, duplicate order creation, or stale inventory reads. -
Java Memory Model defines visibility and ordering guarantees between threads.
volatileprovides visibility and ordering for a single variable,synchronizedprovides mutual exclusion plus happens-before guarantees, andAtomicInteger/Compare-And-Swapsupport lock-free updates for simple state transitions. -
Thread pools avoid creating one thread per task and provide bounded concurrency. For CPU-bound work, start near number of cores; for I/O-bound work, more threads may be useful. Little’s Law, , helps reason about concurrency: in-flight work equals arrival rate times average latency.
-
Locking primitives solve different problems:
synchronizedandReentrantLockprovide mutual exclusion,Semaphorelimits concurrent access,ReadWriteLockallows multiple readers but exclusive writers, andConditionsupports wait/notify-style coordination. Always pair condition checks withwhile, notif, because of spurious wakeups. -
Deadlock requires four conditions: mutual exclusion, hold-and-wait, no preemption, and circular wait. Practical prevention includes global lock ordering, timeouts via
tryLock, minimizing lock scope, and avoiding blocking I/O while holding a lock. -
Design patterns are communication tools, not decorations. Singleton needs thread-safe initialization, often via enum or static holder; Factory Method hides object creation; Strategy replaces condition-heavy branching with interchangeable behavior. Overusing patterns can increase indirection and reduce readability.
-
Python GIL is relevant as a comparison point:
CPythonallows only one thread to execute Python bytecode at a time, so threads help I/O-bound tasks more than CPU-bound tasks. Java has true parallel threads, so data races and memory visibility issues become more central.
Worked example
For “Explain Java DI, AOP, and transactions”, start by framing the answer as “how Spring turns plain classes into managed runtime behavior.” In the first 30 seconds, clarify whether the interviewer wants conceptual explanation, implementation internals, or production pitfalls; then state assumptions such as “I’ll describe typical Spring Boot behavior using annotations and proxy-based AOP.” Organize the answer into four pillars: dependency injection, bean lifecycle, proxy/AOP mechanism, and transaction boundaries.
For dependency injection, explain that the container builds a graph of beans and injects dependencies, preferably through constructors. For AOP, explain that annotations such as @Transactional do not magically alter bytecode in normal Spring AOP; a proxy intercepts external method calls and wraps behavior around them. For transactions, describe how the proxy opens or joins a transaction, delegates to the target method, then commits or rolls back depending on the outcome. A strong tradeoff to flag is proxy-based simplicity versus limitations: self-invocation like this.updateUser() bypasses the proxy, so the transactional advice may not run. Close by saying that, with more time, you would cover propagation, isolation levels, rollback rules, and how to test transactional behavior using integration tests against a real database.
A second angle
For “Explain multithreading and locks”, the same depth standard applies, but the framing shifts from framework behavior to correctness under interleavings. Instead of explaining how a container intercepts method calls, focus on shared state, visibility, atomicity, and progress. A good answer names primitives such as synchronized, ReentrantLock, Semaphore, ReadWriteLock, and Condition, then explains when each is appropriate. The candidate should also show implementation instincts: a bounded blocking queue needs a lock, two conditions such as notFull and notEmpty, and while loops around condition waits. The transfer is that both topics require reasoning about hidden runtime behavior: proxies in framework code, scheduler interleavings in concurrent code.
Common pitfalls
Pitfall: Saying “
@Transactionalstarts a transaction whenever the method is called.”
That is only true when the call goes through the managed proxy. A better answer mentions proxy interception, self-invocation failure, method visibility constraints, propagation behavior, and rollback defaults.
Pitfall: Treating
volatileas a general replacement for locks.
volatile gives visibility and ordering for reads/writes of that variable, but it does not make compound actions like count++ atomic. For increments, use AtomicInteger, LongAdder under high contention, or a lock if multiple fields must change consistently.
Pitfall: Listing design patterns without explaining tradeoffs.
“Use Singleton, Factory, and Strategy” sounds memorized if you do not discuss coupling, testability, lifecycle, and thread safety. A stronger answer says when a pattern helps, when dependency injection is cleaner, and what concurrency bugs a naive implementation could introduce.
Connections
Interviewers may pivot from this area into JVM internals, including garbage collection, class loading, and JIT compilation; distributed systems, including idempotency, retries, and consistency; or database correctness, including isolation levels, indexes, and locking. They may also ask you to implement a small concurrent class or refactor code to use Strategy, Factory, or dependency injection cleanly.
Further reading
-
Java Concurrency in Practice — the classic practical reference for Java thread safety, memory visibility, locks, and executor design.
-
Spring Framework Reference Documentation — authoritative source for
Springbean lifecycle, AOP proxying, and transaction semantics. -
Design Patterns: Elements of Reusable Object-Oriented Software by Gamma, Helm, Johnson, and Vlissides — canonical treatment of Singleton, Factory Method, Strategy, and their tradeoffs.
Practice questions
System Design
- Scalable Distributed System Architecture — covered in depth under Onsite below.
What's being tested
Interviewers are probing whether you can operate production services under real traffic: define service-level indicators, diagnose latency or failure spikes, design overload protection, and make tradeoffs between reliability, cost, and feature completeness. For TikTok-scale systems, a small inefficiency or missing guardrail can cascade across millions of requests, so a Software Engineer is expected to reason beyond “my code works” into capacity, observability, dependencies, and graceful degradation. Strong answers show a loop: measure with the right signals, form hypotheses, isolate bottlenecks, mitigate safely, and verify impact. You should be ready to discuss concrete mechanisms like `p99` latency, `Kubernetes` pod failures, `Redis` eviction policies, circuit breakers, backpressure, and SLO-based prioritization.
Core knowledge
-
SLI/SLO/SLA are distinct: an SLI is a measured signal like
`availability`,`error_rate`, or`p99_latency`; an SLO is the internal target, such as`99.9%`successful requests under`300ms`; an SLA is the external contract with penalties. -
Latency percentiles matter more than averages for user-facing systems. Track
`p50`,`p95`,`p99`, and sometimes`p999`; mean latency can hide tail spikes caused by lock contention, GC pauses, noisy neighbors, network retries, or overloaded downstream services. -
Little’s Law helps reason about queues: , where is in-flight work, is arrival rate, and is average time in system. If arrival rate exceeds service capacity, queue length grows unbounded and tail latency explodes.
-
Observability should include metrics, logs, and traces. Use RED metrics for request services: Rate, Errors, Duration. Use USE metrics for resources: Utilization, Saturation, Errors across CPU, memory, disk, network, thread pools, and connection pools.
-
Performance diagnosis should move layer by layer: client latency, edge/load balancer, application handler, downstream RPCs, database queries, cache hit rate, host resources, and deployment changes. A strong answer avoids guessing and correlates symptoms with deploys, traffic shape, dependency health, and saturation.
-
Load shedding protects the system by rejecting or degrading work before collapse. Common tools include token-bucket rate limiting, bounded queues, priority classes, request deadlines, adaptive concurrency limits, circuit breakers, and returning
`429`,`503`, cached, or partial responses. -
Backpressure pushes overload signals upstream instead of silently accumulating work. Examples include bounded worker queues, non-blocking rejection,
`gRPC`deadlines, client retry budgets, and avoiding infinite retries that amplify load during an incident. -
Circuit breakers isolate failing dependencies. A typical state machine has closed, open, and half-open states; open after error-rate or timeout thresholds, periodically probe in half-open, and close only after sustained recovery. Pair with fallbacks and timeouts.
-
Kubernetesfailure debugging starts with pod state and events:`CrashLoopBackOff`,`OOMKilled`,`ImagePullBackOff`, readiness probe failures, CPU throttling, node pressure, misconfigured resource requests/limits, bad config maps, or dependency startup ordering. Check`kubectl describe pod`, logs, events, and rollout history. -
Capacity planning should estimate peak QPS, per-request CPU/memory, dependency limits, and headroom. If one instance handles
`500`QPS at target`p99`, and peak is`20k`QPS, you need at least`40`instances before redundancy, zone failure tolerance, and deploy surge capacity. -
Redistradeoffs include persistence, memory policy, replication, and clustering.`RDB`snapshots are compact but can lose recent writes;`AOF`improves durability with write amplification; eviction policies like`allkeys-lru`,`volatile-ttl`, or`noeviction`must match whether stale or missing cache entries are acceptable. -
Caching failure modes include stampedes, hot keys, stale data, and cache/database inconsistency. Use TTL jitter, request coalescing, negative caching, hot-key sharding, single-flight locks, and clear ownership of whether
`Redis`is a cache, primary store, queue, or coordination primitive.
Worked example
For “Design overload protection with load shedding”, start by clarifying the service shape: is it a stateless HTTP API, what is peak QPS, which requests are user-critical, what downstream dependencies exist, and what SLO must be preserved under overload? Then declare an assumption such as: “I’ll design for a user-facing read-heavy service where protecting `p99` latency and availability is more important than serving every low-priority request.” A strong answer can be organized around four pillars: admission control at the edge, bounded work inside the service, dependency isolation, and graceful degradation.
At the edge, propose per-user or per-token rate limits using token buckets, plus global adaptive limits when fleet saturation crosses thresholds. Inside the service, use bounded queues, request deadlines, worker-pool limits, and fast rejection instead of letting memory or threads grow until the process dies. For dependencies, add circuit breakers, bulkheads, timeout budgets, and fallback paths such as cached responses or reduced payloads. Explicitly flag the tradeoff: aggressive shedding protects the majority of users and keeps recovery fast, but it can reject legitimate burst traffic, so limits should be observable, configurable, and tested under load. Close by saying that with more time you would add chaos/load tests, retry-budget enforcement on clients, and dashboards showing accepted QPS, shed QPS, saturation, and user-visible error rates.
A second angle
For “Explain Redis design, persistence, and scaling”, the same reliability thinking applies but the bottleneck is now stateful infrastructure rather than request admission. You still need to ask whether `Redis` is used as a cache, session store, rate limiter, lock service, or primary-ish data store, because each use case changes durability and consistency expectations. For a cache, `allkeys-lru`, TTL jitter, and read-through fallback may be fine; for rate limiting, atomicity via `INCR` plus expiry or Lua scripting matters more. Scaling requires discussing memory limits, hot keys, replication lag, failover behavior, and cluster slot distribution. The key transfer is that reliability is not “make `Redis` fast”; it is choosing failure behavior intentionally when memory fills, a node dies, or traffic concentrates on one key.
Common pitfalls
Pitfall: Treating reliability as just “add more machines.”
Horizontal scaling helps only if the bottleneck is stateless compute. If the real issue is a saturated database connection pool, hot `Redis` key, synchronized retries, slow external RPC, or unbounded queue, adding pods may amplify pressure downstream and worsen tail latency.
Pitfall: Jumping to fixes before defining symptoms and success metrics.
A weak answer says “I would cache it” or “I would optimize the query” without naming the SLI being improved. A better answer says: “The regression is `p99` latency from `220ms` to `900ms`, correlated with a deploy and increased DB time; I’ll roll back or mitigate first, then profile and validate against the SLO.”
Pitfall: Staying too shallow on operational mechanisms.
Saying “use monitoring, rate limiting, and `Kubernetes`” is not enough. Interviewers expect you to explain what you would monitor, where you would enforce limits, what happens to rejected requests, how readiness/liveness probes differ, and how the system recovers without causing retry storms.
Connections
Interviewers may pivot from here into distributed systems consistency, database indexing and query optimization, microservice API design, or incident response and postmortems. They may also connect overload protection to rate limiter design, `Redis` atomic operations, `Kafka` consumer lag, or `Kubernetes` autoscaling behavior.
Further reading
-
Google SRE Book — canonical reference for SLOs, error budgets, toil, incident response, and production reliability practices.
-
The Tail at Scale, Dean and Barroso — foundational paper on why tail latency dominates large-scale user-facing systems.
-
Redis Persistence Documentation — practical details on
`RDB`,`AOF`, durability tradeoffs, and operational implications.
Practice questions
Behavioral & Leadership
What's being tested
Interviewers are probing ownership: whether you can take a software project from ambiguous problem to reliable delivery, measurable impact, and thoughtful follow-up. For a Software Engineer at TikTok scale, this means connecting engineering decisions to concrete outcomes like latency, crash rate, feed load success, CTR, watch time, or creator workflow completion without pretending to be the PM or Data Scientist. You are expected to reason about metrics, tradeoffs, instrumentation, debugging, collaboration, and risk management in a technically credible way. Strong answers show that you did not just “ship code”; you defined success, made constraints explicit, handled ambiguity, and learned from the result.
Core knowledge
-
STAR is the baseline structure for behavioral answers: Situation, Task, Action, Result. For senior-quality answers, add tradeoff, metric, and reflection so the interviewer hears judgment, not just chronology.
-
Ownership scope for a Software Engineer includes clarifying requirements, proposing technical options, identifying risks, implementing and reviewing code, monitoring rollout, and driving follow-ups. It does not require inventing product strategy, but it does require asking, “What user or system behavior should improve?”
-
Outcome metrics capture the main goal:
p95 feed load latency,video publish success rate,message delivery success,crash-free sessions, orcreator upload completion. Pick one primary metric when possible; too many “primary” metrics make the project look unfocused. -
Secondary metrics explain mechanism: cache hit rate, queue depth, retry count, database query count, API error distribution, client render time, or upload chunk failure rate. They help prove why the outcome moved and are often more actionable for engineers.
-
Guardrail metrics prevent harmful wins:
p99 latency, memory usage, CPU utilization, battery drain, bandwidth, error rate, abuse reports, accessibility regressions, or rollback frequency. A good answer says, “We optimized X, but watched Y to ensure we did not degrade reliability.” -
Instrumentation should be designed before launch. Define event names, required fields, correlation IDs, sampling policy, and success/failure semantics. For example, a video upload flow may log
upload_started,chunk_retry,transcode_completed, andpublish_succeededwith a sharedrequest_id. -
Reliability metrics are often clearer than vague “quality.” Use service-level indicators such as availability, latency, and correctness. Availability can be expressed as and tied to an SLO like
99.9%successful publishes. -
Latency metrics should use percentiles, not averages.
p50shows typical experience, butp95andp99reveal tail problems that matter at TikTok scale. A mean latency improvement can hide worse outliers if a dependency or cache path regresses. -
Rollout strategy is part of ownership. Mention feature flags, canary release, staged percentage rollout, dark launch, rollback plan, and dashboards. For risky backend changes, start with internal traffic, then
1%,5%,25%, and full rollout after guardrails remain stable. -
Debugging under ambiguity should move from broad to narrow: reproduce, inspect logs and traces, compare cohorts or versions, isolate recent changes, form hypotheses, test one variable at a time, and document the root cause. Avoid jumping straight to a favorite explanation.
-
Tradeoff reasoning should be explicit: latency vs correctness, consistency vs availability, simplicity vs extensibility, storage cost vs query speed, and short-term patch vs long-term architecture. The interviewer wants to hear why your chosen path was reasonable under constraints.
-
Impact should be quantified whenever possible: “reduced
p95latency from850 msto420 ms,” “cut retry storms by70%,” “improved upload success by2.3 percentage points,” or “reduced on-call pages from12/weekto2/week.”
Worked example
For “Define and measure project metrics,” a strong candidate should start by clarifying the project goal in the first 30 seconds: “Are we optimizing user-perceived performance, reliability, engagement, or engineering efficiency? What surface is affected, and what is the rollout scope?” Then declare assumptions, such as: “Suppose this is a backend change to reduce video publish failures for creators.” The answer can be organized into four pillars: primary outcome metric, diagnostic secondary metrics, guardrails, and measurement plan. The primary metric might be publish_success_rate, defined as completed publishes divided by valid publish attempts, excluding user cancellations. Secondary metrics could include upload_retry_count, transcode failure rate, API timeout rate, and dependency latency. Guardrails would include p99 publish latency, storage cost, CPU usage, and error rate for unrelated publish flows. A tradeoff to flag explicitly is that aggressive retries may improve success rate but increase backend load and user wait time, so retries need capped exponential backoff and monitoring. The measurement plan should include pre-launch baseline, dashboard ownership, staged rollout, alert thresholds, and a rollback condition like “rollback if p99 latency increases by more than 20% for 30 minutes.” Close by saying: “If I had more time, I would validate whether failures are concentrated by app version, region, network type, or media size so we can target the next fix instead of overgeneralizing.”
A second angle
For “Describe a project you are proud of,” the same ownership pattern applies, but the framing is more narrative than metric-design focused. Choose a project where you can explain the technical challenge, your specific contribution, and measurable result without sounding like the whole team’s work was yours alone. A strong answer might cover a migration from synchronous processing to an asynchronous queue-backed workflow, emphasizing why the old design failed under traffic spikes, how you evaluated alternatives, and how you reduced user-facing timeouts. The metrics still matter, but they appear as evidence: p95 latency dropped, error rate improved, operational load decreased, or deployment frequency increased. The close should include what you learned and what you would improve, such as better load testing, earlier stakeholder alignment, or more complete observability before launch.
Common pitfalls
Pitfall: Giving a product-only answer with no engineering substance.
A weak answer says, “We wanted to increase engagement, so I worked with PM and launched a feature that users liked.” That may be fine for a PM interview, but a Software Engineer should explain the technical constraints, implementation choices, reliability risks, rollout plan, and how the system behaved after launch.
Pitfall: Reporting metrics without definitions.
Saying “latency improved by 40%” is incomplete if you do not specify p50, p95, client-side vs server-side, measurement window, traffic segment, and whether the comparison was before/after or controlled rollout. A better answer defines the metric precisely and acknowledges caveats: “This was server-side p95 over seven days of comparable traffic.”
Pitfall: Using STAR mechanically and hiding judgment.
Many candidates recite Situation, Task, Action, Result but skip conflict, uncertainty, and tradeoffs. Interviewers learn more when you say, “We had two options; I chose the simpler feature-flagged path because the deadline was close and the blast radius was high, then planned a follow-up refactor.”
Connections
This topic often pivots into system design, especially observability, staged rollout, reliability, and scalability tradeoffs. It can also connect to debugging incidents, cross-functional collaboration, and basic experimentation hygiene when the interviewer asks how you knew your change actually caused the metric movement.
Further reading
-
Google SRE Book — Practical vocabulary for
SLO, error budgets, incident response, and reliability ownership. -
Accelerate by Nicole Forsgren, Jez Humble, and Gene Kim — Strong framing for engineering delivery metrics, deployment frequency, lead time, change failure rate, and recovery time.
-
The STAR Method — Useful baseline structure for behavioral answers, though strong engineering answers should add metrics and tradeoffs.
Practice questions
Onsite
Coding & Algorithms

What's being tested
Graph/grid connectivity questions test whether you can model cells or entities as nodes, traverse components correctly, and preserve invariants under edge cases like cycles, wrap-around, or repeated shapes. Interviewers look for clean DFS/BFS, Union-Find, topological sort, and proof-level reasoning about correctness and complexity.
Patterns & templates
-
Grid
DFS/BFScomponents — scan every cell, start traversal on unvisited land;O(R*C)time,O(R*C)worst-case space. -
Distinct island normalization — record relative coordinates from an origin, e.g.
(r-r0, c-c0); sort/serialize shape to compare patterns. -
Bipartite graph coloring — assign colors with
BFS/DFS; conflict on same-color edge means impossible; handle disconnected components. -
Topological sort — use
Kahn’s algorithmwith indegrees or postorderDFS; cycle exists if processed count< n. -
Dynamic connectivity — use Disjoint Set Union with path compression and union by rank; near-constant amortized
α(n)operations. -
Torus grid adjacency — compute neighbors with modulo:
(r+dr+R)%R,(c+dc+C)%C; beware duplicate neighbors in tiny grids. -
Monotonic deque — for sliding maximum, keep indices in decreasing value order;
O(n)time, remove expired front indices.
Common pitfalls
Pitfall: Treating diagonal cells as connected when the problem only allows 4-directional adjacency.
Pitfall: Counting distinct islands by traversal string without stable direction markers or relative coordinates, causing different shapes to collide.
Pitfall: Running topological sort from only one node; disconnected graph components must all be initialized or visited.
Practice these
The practice cards below cover the canonical variants — solve all of them and time yourself.
Practice questions
System Design

What's being tested
Interviewers are probing whether you can turn an ambiguous, large-scale product requirement into a scalable distributed architecture with clear APIs, data models, traffic estimates, failure handling, and operational tradeoffs. For TikTok-scale systems, a Software Engineer must reason about latency, availability, consistency, capacity, and cost under global traffic, bursty fanout, and partial failures. Strong answers show you can decompose a system into services, choose storage and messaging primitives intentionally, and explain how the design behaves at p99, not just in the happy path. The interviewer is also checking whether you can communicate architecture clearly: requirements first, then data flow, then bottlenecks, then mitigations.
Core knowledge
-
Requirements framing should separate functional requirements from non-functional requirements. For example: “send push/email/SMS notifications” is functional;
p99 < 500msenqueue latency,99.99%availability, regional data residency, deduplication, and retry semantics are non-functional constraints that shape the architecture. -
Back-of-the-envelope capacity planning grounds every design. Estimate requests per second with and storage with Use peak factors like
5x–20xfor notifications, ticket launches, and viral traffic. -
Service decomposition should follow ownership and scaling boundaries. Typical components include an API gateway, stateless application services, durable storage, cache, message queue, worker fleet, rate limiter, and observability pipeline. Avoid over-splitting early; introduce separate services when scaling, isolation, or deployment cadence requires it.
-
Stateless services behind load balancers scale horizontally and simplify failover. Keep request state in
Redis,MySQL,Postgres,DynamoDB,Cassandra, or object storage rather than local memory. Statelessness enables autoscaling but shifts correctness to storage, cache invalidation, and idempotency. -
Data modeling depends on access patterns.
PostgresorMySQLworks for relational constraints and transactions;DynamoDBorCassandrafits high-write, key-value workloads;Elasticsearchfits search;Redisfits ephemeral counters, locks, and hot reads. Always state primary keys, secondary indexes, and the most common query paths. -
Consistency tradeoffs are central. Strong consistency is needed for inventory, payments, and ticket ownership; eventual consistency is acceptable for feeds, notifications, counters, and analytics-like status updates. Name the failure mode: stale reads, duplicate sends, lost updates, split-brain leadership, or conflicting writes.
-
Idempotency prevents duplicate side effects during retries. Use an
idempotency_key, request hash, or natural unique key such as(user_id, campaign_id, channel)and store a processed record with TTL. This is critical for notification sends, order creation, reservation confirmation, and external provider callbacks. -
Message queues such as
Kafka,Pulsar,RabbitMQ, or cloud queues decouple producers from consumers. They absorb spikes, support retries, and allow worker autoscaling. Discuss delivery semantics: at-most-once may lose messages, at-least-once may duplicate, exactly-once is expensive and often approximated with idempotent consumers. -
Partitioning and sharding control scale and hotspots. Hash by
user_idfor even distribution, byevent_idfor immutable events, or by region for data residency. Beware celebrity users, flash sales, and global announcements; these create hot keys that require fanout batching, token buckets, or precomputed shards. -
Caching improves latency but adds invalidation complexity. Use
CDNfor static assets,Redisfor hot objects and counters, and local in-process caches for small reference data. Define TTLs, cache-aside versus write-through behavior, and how the system behaves on cache miss or cache outage. -
Reliability patterns include timeouts, retries with exponential backoff and jitter, circuit breakers, bulkheads, dead-letter queues, and graceful degradation. Retries must be bounded; otherwise they amplify outages. A good default is short client timeouts, retry only idempotent operations, and send poison messages to a
DLQ. -
Observability must include metrics, logs, and traces. Track
QPS,p50/p95/p99latency, error rate, queue lag, retry count, consumer throughput, saturation, dropped messages, duplicate rate, and provider-specific failures. Tie alerts to user impact and SLOs, not just CPU thresholds.
Worked example
For Design a global notification service, start by clarifying scope in the first 30 seconds: “Are we supporting push only, or also email and SMS? Is the requirement low-latency transactional notifications, bulk marketing campaigns, or both? Do we need regional data residency and user-level opt-out preferences?” Then declare assumptions: multi-tenant service, billions of notifications per day, at-least-once delivery internally, no duplicate visible sends when possible, and regional active-active deployment.
Organize the answer around four pillars: API surface, data model, delivery pipeline, and reliability/operations. The API might include POST /notifications, POST /campaigns, GET /status/{id}, and preference-management endpoints. The data model should include notification templates, user channel tokens, preferences, tenant quotas, send attempts, and deduplication records keyed by idempotency_key. The delivery path should be: API validates and persists request, publishes to Kafka, channel-specific workers consume, apply rate limits and preferences, call external providers such as APNs or FCM, and record delivery status asynchronously.
A key tradeoff to flag is fanout timing: fanout-on-write sends all user notifications immediately and gives low latency, but creates huge bursts; fanout-on-read or staged fanout smooths load but delays delivery and complicates status tracking. For global delivery, explain regional routing: keep user preference data in-region when required, use local queues and workers, and fail over only traffic that is legally and operationally safe to move. Close by saying: “If I had more time, I’d go deeper on provider-specific throttling, tenant isolation, disaster recovery drills, and exactly how SLOs map to alert thresholds.”
A second angle
For Design a high-volume ticketing system, the same architecture vocabulary applies, but the dominant constraint changes from asynchronous fanout to strong consistency under contention. A notification service can usually tolerate duplicate internal processing if the final send is deduped; a ticketing system cannot oversell inventory. The design should emphasize reservation records, short TTL holds, transactional decrement or compare-and-swap, queue-based admission control, and anti-hotspot strategies for popular events. Redis can help with waiting rooms or rate limiting, but the source of truth for ticket ownership needs a durable store with transactional guarantees, such as Postgres, MySQL, or a carefully designed distributed database. The strongest candidates explicitly contrast “smooth bursts with queues” versus “serialize scarce inventory updates.”
Common pitfalls
Pitfall: Jumping straight into boxes and arrows without requirements.
A tempting answer is “I’ll use Kafka, Redis, and microservices” before defining traffic, latency, consistency, or failure expectations. A stronger answer starts with scope, assumptions, core APIs, and capacity estimates, then chooses technologies because they satisfy those constraints.
Pitfall: Treating all data as equally consistent.
Many candidates say “eventual consistency is fine” for everything, which fails for ticket purchases, quota enforcement, and user preference compliance. Call out which operations need strong consistency, which can be eventually consistent, and what user-visible anomaly might occur if the system returns stale data.
Pitfall: Ignoring operational behavior after the initial design.
A design that works only when dependencies are healthy is incomplete. Discuss what happens when Redis is down, a Kafka partition lags, an external push provider throttles, or a region fails; include retries, dead-letter queues, backpressure, dashboards, and manual replay paths.
Connections
Interviewers may pivot from distributed architecture into API design, database indexing and sharding, message queue semantics, rate limiting, or incident debugging. They may also ask you to deep-dive a project you built, so prepare one real architecture where you can explain the original constraints, the bottleneck you hit, the tradeoff you chose, and what you would redesign now.
Further reading
-
Designing Data-Intensive Applications — the best single reference for replication, partitioning, transactions, stream processing, and consistency tradeoffs.
-
Dynamo: Amazon’s Highly Available Key-value Store — foundational paper for quorum reads/writes, consistent hashing, and availability-oriented storage.
-
Google SRE Book — practical treatment of SLOs, error budgets, monitoring, and operating reliable distributed systems.
Practice questions
Take-home Project
Coding & Algorithms

What's being tested
Binary tree traversal under changing constraints: visibility, deletion/replacement, width indexing, reconstruction, and distance relationships. Interviewers are probing whether you can choose DFS, BFS, recursion, hashing, or parent pointers deliberately, while handling null roots, skewed trees, duplicate values, and overflow.
Patterns & templates
-
Right-side view BFS — level-order traversal, append last node per level;
O(n)time,O(w)space wherewis max width. -
Right-side view DFS — visit
root-right-left, record first node at each depth; cleaner recursion, but watch stack depth on skewed trees. -
Tree deletion / replacement — for BST-style deletion, handle 0/1/2 children; use inorder successor/predecessor; return updated subtree root.
-
Maximum width indexing — BFS with complete-tree positions: left
2*i, right2*i+1; normalize each level to avoid integer overflow. -
Build tree from traversals —
preorder + inorderorinorder + postorder; hashmap value-to-index givesO(n)time, recursive bounds prevent slicing overhead. -
Distance-k pairs / nodes — convert tree to undirected graph via parent map, then BFS; or use postorder distances for pair counting.
-
Validation mindset — confirm unique node values for reconstruction; malformed traversal arrays should fail fast or be documented as unsupported.
Common pitfalls
Pitfall: Treating width as node count per level; width includes missing positions between the leftmost and rightmost non-null nodes.
Pitfall: Reconstructing from traversal arrays with duplicates without extra identity information; the tree is not uniquely determined.
Pitfall: Forgetting to return the new root after deletion, especially when deleting the original root node.
Practice these
The practice cards below cover the canonical variants — solve all of them and time yourself.
Practice questions

What's being tested
Array search, selection, and dynamic programming problems test whether you can turn brute-force scans into structured O(log n), O(n), or O(n log n) solutions. Interviewers look for clean boundary handling, invariant-based reasoning, and the ability to explain time/space tradeoffs under production-like constraints.
Patterns & templates
-
Binary search boundaries — implement
lower_bound(nums, target)andupper_bound(nums, target);O(log n)time, avoid off-by-one errors. -
Quickselect partitioning — find kth largest via in-place
partition; averageO(n), worstO(n^2), randomized pivot reduces risk. -
Heap selection — maintain a min-heap of size
k;O(n log k)time, safer than Quickselect when worst-case predictability matters. -
Dynamic programming for subsequences — LIS uses
dp[i] = max(dp[j] + 1)forj < i;O(n^2)baseline, simple and explainable. -
Patience sorting LIS — maintain
tailsand binary-search replacement index;O(n log n)time, buttailsis not the actual subsequence. -
Prefix/suffix parity sums — deletion-fairness problems need even/odd sums before and after index removal;
O(n)time,O(1)possible. -
Interval sweep / min-heap rooms — sort start times or intervals; count overlapping meetings in
O(n log n), clarify inclusive vs exclusive endpoints.
Common pitfalls
Pitfall: Returning any target index instead of the first/last boundary misses the core binary search invariant.
Pitfall: Treating kth largest as index
kafter sorting; it is usuallynums[len(nums) - k]in ascending order.
Pitfall: For LIS, using
<=instead of<changes “increasing” into “non-decreasing.”
Practice these
The practice cards below cover the canonical variants — solve all of them and time yourself.
Practice questions