Anthropic Software Engineer Interview Prep Guide
Everything Anthropic actually asks Software Engineer candidates — concept walkthroughs, worked examples, and the real interview questions, drawn from candidate reports. Free to read.
Last updated

Technical Screen
Coding & Algorithms
-
File Deduplication And Content Hashing — covered in depth under Onsite below.
-
LRU Cache Design And Persistence — covered in depth under Onsite below.
-
Thread-Safe Queues And Concurrency Primitives — covered in depth under Onsite below.
-
Stack Trace And Profiler Log Processing — covered in depth under Onsite below.
System Design
- Web Crawlers, URL Normalization, And Politeness — covered in depth under Onsite below.
ML System Design
- ML Inference APIs And GPU Batching — covered in depth under Onsite below.
Machine Learning
- ML Fundamentals: Backprop, Attention, And RL — covered in depth under Onsite below.
Behavioral & Leadership
- AI Safety, Mission Alignment, And Leadership Judgment — covered in depth under Onsite below.
Onsite
Coding & Algorithms

What's being tested
This tests content-based duplicate detection under real filesystem constraints: recursive traversal, streaming I/O, hashing, collision handling, and memory-aware grouping. Strong answers show a staged algorithm that avoids reading every byte unnecessarily while still proving duplicates by content.
Patterns & templates
-
Recursive filesystem traversal with
os.walk,scandir, or explicit stack —O(files + dirs)metadata pass; handle permissions, symlinks, and cycles. -
Size-first bucketing — group by file size before hashing; files with unique sizes cannot be duplicates, reducing I/O dramatically.
-
Partial hash then full hash — hash first/last chunks before full content; improves average case while preserving final exact verification.
-
Streaming hash computation using
sha256.update(chunk)—O(total_bytes)time,O(chunk_size)memory; never load large files fully. -
Collision-safe comparison — hash groups identify candidates, then byte-compare files or use cryptographic hashes plus optional verification.
-
Chunk-based deduplication for large files — fixed-size or content-defined chunking with rolling hashes; useful when files share regions but differ globally.
-
Parallel I/O pipeline — worker pool for hashing candidate buckets; bound concurrency to avoid disk thrashing and excessive open file descriptors.
Common pitfalls
Pitfall: Hashing every file immediately ignores the easy
size -> candidates -> hash -> verifypruning pipeline and wastes I/O.
Pitfall: Treating hashes as proof of equality without discussing collisions is incomplete; mention cryptographic hashes and final byte comparison.
Pitfall: Following symlinks blindly can create cycles or duplicate paths to the same inode; track
(device, inode)when needed.
Practice these
The practice cards below cover the canonical variants — solve all of them and time yourself.
Practice questions

What's being tested
This tests LRU cache implementation with O(1) lookup, update, and eviction using a hash map plus doubly linked list. Harder variants add memoization key canonicalization, variable *args/**kwargs, and persistence/crash recovery without losing ordering correctness.
Patterns & templates
-
Hash map + doubly linked list — map keys to nodes; list order stores recency;
get/putmove nodes to front inO(1). -
Sentinel head/tail nodes simplify
remove(node)andinsert_front(node); avoid special cases for empty, one-item, and tail eviction. -
Capacity eviction happens after insert/update; if
size > capacity, removetail.prevand delete its key from the map. -
Decorator memoization wraps
func(*args, **kwargs); key should include function identity plus canonicalized arguments, not just raw positional tuple. -
Canonical argument binding with
inspect.signature(func).bind()normalizes defaults and keyword order; convert unhashable structures recursively before hashing. -
Persistence snapshot serializes capacity, key-value pairs, and recency order using
pickle,json, or custom encoding; restore list order exactly. -
Crash resilience needs atomic writes: write to temp file,
flush/fsync, thenos.replace; optionally use an append-only log plus compaction.
Common pitfalls
Pitfall: Updating a value without moving it to most-recent breaks the LRU contract; both cache hits and overwrites count as use.
Pitfall: Using
str(args) + str(kwargs)for keys is nondeterministic or ambiguous; keyword order and mutable containers must be canonicalized.
Pitfall: Persisting only the dictionary is insufficient; recovery also needs recency order, capacity, and enough metadata to reject corrupted snapshots.
Practice these
The practice cards below cover the canonical variants — solve all of them and time yourself.
Practice questions

What's being tested
This tests concurrent data structure design: implementing FIFO queues/buffers that remain correct under multiple producers and consumers. Interviewers probe whether you can use mutexes, condition variables, semaphores, and shutdown semantics without races, deadlocks, busy-waiting, or lost wakeups.
Patterns & templates
-
Bounded blocking queue — store items in
collections.deque;put()waits while full,get()waits while empty; both areO(1). -
Condition-variable loop — always call
wait()insidewhile not predicate; handles spurious wakeups, missed notifications, and predicate changes after reacquiring lock. -
Producer–consumer template — one lock protects queue state;
not_empty.notify()after enqueue,not_full.notify()after dequeue; avoid holding lock during expensive work. -
Timed waits — compute absolute deadline with
time.monotonic(); loop with remaining timeout; returnFalse,None, or raiseTimeoutErrorconsistently. -
Shutdown protocol — maintain
closedflag under the same lock; wake all waiters withnotify_all(); define whether pending items drain or abort. -
CPU vs I/O concurrency — Python threads help I/O-bound work despite the GIL; CPU-bound image processing usually needs
multiprocessingor native extensions. -
Thread-pool pipeline — use
queue.Queue, worker sentinels,join(), and exception collection; bound queue size to apply backpressure and cap memory.
Common pitfalls
Pitfall: Using
if queue_empty: wait()instead ofwhile queue_empty: wait()can break under spurious wakeups or competing consumers.
Pitfall: Calling callbacks, image transforms, network I/O, or disk writes while holding the queue lock serializes the system and risks deadlock.
Pitfall: Forgetting shutdown behavior leaves blocked producers or consumers hanging forever; explicitly wake waiters and document drain-vs-cancel semantics.
Practice these
The practice cards below cover the canonical variants — solve all of them and time yourself.
Practice questions

What's being tested
This tests stack simulation over ordered execution logs: reconstruct active calls, compute inclusive and exclusive time, and emit derived trace events. Interviewers are probing careful interval reasoning, edge-case handling, and clean O(n) stream-processing code.
Patterns & templates
-
Call-stack simulation — maintain
stack[(fn, start_ts, child_time)]; onend, exclusive time isend - start - child_time. -
Previous-timestamp accounting — for LeetCode-style
start/endlogs, chargets - prev_tstostack[-1]; updateprev_tsafter each event. -
Stack sample diffing — compare old and new frame arrays by longest common prefix; emit
endevents deepest-first,startevents shallowest-first. -
Call tree reconstruction — parse indentation, frame depth, or parent IDs; attach nodes via a stack of current ancestors in
O(n)time. -
Streaming validation — detect missing
end, impossibleendorder, negative durations, equal timestamp ambiguity, and recursion by tracking frame identity. -
Complexity target — aim for
O(n)time andO(d)space, wheredis max stack depth; aggregate per-function maps separately.
Common pitfalls
Pitfall: Confusing inclusive time with exclusive time; parent time must subtract completed child intervals or only receive gaps while active.
Pitfall: Handling equal timestamps inconsistently; declare whether intervals are half-open
[start, end)or inclusive, then apply it everywhere.
Pitfall: Treating function name as unique identity during recursion; recursive calls need separate stack frames even when
fnis identical.
Practice these
The practice cards below cover the canonical variants — solve all of them and time yourself.
Practice questions
System Design

What's being tested
These interviews probe whether you can design a concurrent, polite, fault-tolerant crawler rather than just “spawn threads and fetch URLs.” The interviewer is looking for practical distributed-systems judgment: URL frontier design, deduplication, synchronization, per-host rate limiting, failure recovery, and observability under network uncertainty. Anthropic cares about this because crawlers exercise the same engineering muscles as many production systems: high fan-out I/O, adversarial inputs, backpressure, fairness, idempotency, and careful resource usage. A strong answer makes tradeoffs explicit: single-machine versus distributed, exact versus approximate deduplication, throughput versus politeness, and freshness versus crawl coverage.
Core knowledge
-
URL frontier is the central scheduling abstraction: it stores discovered-but-unfetched URLs and decides what to fetch next. A robust design usually separates a global priority queue from per-host queues so you can enforce fairness and host-specific politeness without starving the crawl.
-
Per-host politeness means never hammering one domain even if many workers are idle. Track
nextFetchTime[host]; a URL is eligible only whennow >= nextFetchTime[host]. If crawl delay is seconds and last fetch was , schedule the next fetch at . -
Concurrency model depends on scale. On one machine, a thread pool or async I/O loop with bounded queues is enough; network fetches are I/O-bound, so
asyncio,epoll, or nonblocking sockets can outperform thousands of threads. Distributed crawlers need partitioning by host or URL hash. -
Host partitioning is safer than pure URL hashing when enforcing politeness. If all URLs for
example.comroute to the same shard, rate limiting is local and simple. If URLs are randomly distributed, you need distributed locks or shared rate-limit state, which adds latency and failure modes. -
URL normalization prevents duplicate work. Normalize scheme and host case, remove default ports, resolve
.and.., strip fragments, sort query parameters when order is semantically irrelevant, percent-decode safe characters, and canonicalize trailing slashes carefully. Be conservative: over-normalization can merge distinct pages. -
URL deduplication usually combines exact and approximate techniques. Exact sets using URL hashes work up to memory limits; a 64-bit hash for 1B URLs needs roughly 8 GB just for hashes before overhead. Bloom filters reduce memory with false positives: where is bits, entries, and hash functions.
-
Content deduplication catches different URLs serving the same page. Use cryptographic hashes like
SHA-256for exact duplicates, and SimHash or MinHash for near-duplicates. URL dedup saves fetches; content dedup saves storage and downstream processing. -
Robots and crawl directives are part of politeness. Fetch and cache
robots.txtper host, honorDisallow,Allow, andCrawl-delaywhere applicable, and choose a clear user agent. Cache negative results too, but set TTLs because policies can change. -
Backpressure is mandatory. Bound the frontier, fetch queue, parser queue, and storage writes; if storage slows down, workers should stop pulling more URLs rather than accumulating unbounded memory. Track queue depth, worker utilization, and fetch latency percentiles such as
p95andp99. -
Fault tolerance requires idempotent state transitions. A URL can move through states like
DISCOVERED,SCHEDULED,FETCHING,FETCHED,FAILED, andRETRYABLE. If a worker dies mid-fetch, a lease or visibility timeout should return the URL to the frontier without duplicating completed work. -
Retry policy should distinguish transient and permanent failures. Retry
429,500,502,503, and timeouts with exponential backoff and jitter; do not blindly retry404or410. For429, respectRetry-Afterand lower the host’s effective rate. -
Observability should expose both crawler health and ethical behavior. Useful metrics include fetches/sec, bytes/sec, unique URLs discovered, duplicate rate, error rate by status code, per-host request rate, robots-denied count, frontier size, retry count, and storage write latency.
Worked example
For Design a concurrent web crawler, start by clarifying scope: “Is this single-machine or distributed? Should we crawl the whole web or a fixed set of domains? Do we need freshness recrawls, or just one-time discovery? What politeness constraints must we honor?” Then declare assumptions, for example: single region, distributed crawler, billions of URLs, HTML pages only, strict per-host rate limits, and persistent crawl state.
A strong answer can be organized into four pillars: frontier and scheduling, fetch/parse pipeline, deduplication and storage, and fault tolerance/observability. For frontier design, propose per-host queues plus a global min-heap ordered by each host’s next eligible fetch time. Workers pop an eligible host, fetch one URL, parse links, normalize and dedup them, enqueue new URLs, and update that host’s next fetch time.
For deduplication, use an exact URL-hash store if scale permits, or a Bloom filter in front of durable storage to reduce lookups. For persistence, store URL metadata, fetch status, content hash, timestamps, HTTP status, and retry count in a durable database or log-backed store. One explicit tradeoff to flag: partitioning by host simplifies politeness and reduces coordination, but may create hot shards for huge domains like wikipedia.org; you can mitigate with intra-host subqueues while preserving a single rate limiter.
Close by saying that, with more time, you would discuss recrawl scheduling, canonical URL signals, distributed rebalancing, and anti-abuse safeguards such as robots compliance and adaptive throttling after 429 responses.
A second angle
For Implement crawler, dedup, and persistent LRU, the same concepts become more implementation-focused and less about large-scale distributed architecture. The interviewer may expect you to write or sketch code for graph traversal, URL normalization, a visited set, and a durable cache with LRU eviction. Instead of discussing host-sharded frontier services, focus on clean data structures: a queue for BFS, a hash set for visited URLs, a doubly linked list plus hash map for O(1) LRU operations, and serialization for restart. The same tradeoff appears at smaller scale: exact deduplication is simple and correct, while persistent approximate deduplication saves memory but can skip valid URLs. This variant rewards code clarity, edge-case handling, and deterministic behavior under crashes.
Common pitfalls
Pitfall: Designing only a thread pool and a queue.
This is the most common incomplete answer: “Use 100 workers, each pops a URL and fetches it.” That ignores per-host politeness, dedup races, retries, and bounded memory. A better answer starts with the frontier as the control plane and treats workers as stateless executors.
Pitfall: Over-normalizing URLs.
It is tempting to sort all query parameters, drop all tracking-looking parameters, or force trailing-slash conventions globally. Those rules can merge distinct resources, especially for sites where query order or parameters are meaningful. Say you would apply conservative normalization first, then optionally learn site-specific canonicalization from redirects, <link rel="canonical">, and observed duplicates.
Pitfall: Claiming “exactly once” crawling.
In distributed networked systems, exactly-once fetch semantics are usually unrealistic and unnecessary. Aim for at-least-once execution with idempotent writes, deduplication, leases, and content hashing. Interviewers respond better to acknowledging duplicate fetches as possible and showing how you contain their cost.
Connections
Interviewers may pivot from this topic into distributed queues, rate limiting algorithms, consistent hashing, Bloom filters, cache eviction, or web security concerns such as SSRF prevention and malicious HTML parsing. They may also ask for a deeper dive on scheduler fairness, persistent state machines, or debugging a crawler whose throughput suddenly drops.
Further reading
-
Mercator: A Scalable, Extensible Web Crawler — classic paper on crawler architecture, URL frontier management, and scaling concerns.
-
The Anatomy of a Large-Scale Hypertextual Web Search Engine — includes early Google crawling and indexing design context.
-
RFC 9309: Robots Exclusion Protocol — formal reference for
robots.txtbehavior and crawler access rules.
Practice questions

What's being tested
You’re being tested on low-level performance engineering: the ability to reason from source code down to compiler output, processor pipelines, memory hierarchy, and measurement methodology. The interviewer is probing whether you can improve a hot path without guessing: profile first, form hypotheses, change one variable, verify correctness, and quantify speedup. For Anthropic, this matters because software engineers often work near performance-critical infrastructure where small inefficiencies in kernels, serialization, scheduling, or memory movement can become expensive at scale. Strong answers show practical judgment: you know when to trust the compiler, when to guide it, when to rewrite code, and when an optimization is too clever to maintain.
Core knowledge
-
Profiling before optimization is non-negotiable. Start with wall-clock time, CPU time, hardware counters, and flame graphs using tools like
perf,VTune,Linux perf_events,pprof, or simulator traces. Optimize only a measured bottleneck, not code that “looks slow.” -
Speedup math should be explicit. Use Amdahl’s Law: if fraction is improved by factor , total speedup is A 10x improvement to 20% of runtime only gives overall.
-
Benchmark design must control noise. Pin threads with
taskset, warm caches/JITs if relevant, disable frequency scaling where possible, run enough iterations, report median plus variance, and separate cold-start from steady-state behavior. For tiny kernels, use batches to avoid timer overhead dominating the result. -
Correctness verification needs equal rigor to speed measurement. Keep a scalar reference implementation, compare outputs bit-for-bit for integer code, use tolerances for floating point such as
rtol/atol, and test edge cases: zero length, unaligned addresses, NaNs, overflow, negative values, and non-multiple vector widths. -
Compiler optimization control includes flags, pragmas, attributes, and source transformations. Know
-O2,-O3,-march=native,-ffast-math,restrict,inline,noinline,#pragma unroll,#pragma clang loop vectorize(enable), and intrinsics such asAVX2/AVX-512. Each can improve codegen or silently change semantics. -
Assembly inspection is often the fastest way to validate assumptions. Use
objdump,Compiler Explorer,llvm-mca, orperf annotateto check whether a loop vectorized, whether loads are hoisted, whether branches remain, and whether the compiler emitted expensive divisions, spills, or scalar fallback paths. -
Memory hierarchy usually dominates simple kernels. Reason about cache lines, spatial locality, temporal locality, prefetching, TLB misses, and bandwidth. A useful model is arithmetic intensity: operations per byte loaded. Low-intensity kernels are memory-bound; more ALU tricks will not help much.
-
Branchless programming can reduce misprediction penalties but is not free. Replacing
ifwith masks,cmov, bitwise operations, or table lookups helps when branches are unpredictable. If branches are highly predictable, branchless code may add instructions, increase register pressure, and perform worse. -
Bitwise tricks are useful when they clarify a machine-level operation: powers of two via
x & (x - 1), alignment via(x + a - 1) & ~(a - 1), modulo by power of two viax & (n - 1), and sign masks via shifts. Watch for signed overflow and implementation-defined shifts. -
Instruction-level parallelism depends on dependency chains, latency, and throughput. A loop with a serial accumulator may bottleneck on add latency; multiple accumulators can expose parallelism. The goal is to keep execution ports busy without exceeding register capacity or causing spills.
-
Pipeline hazards matter in scheduled architectures. Understand RAW read-after-write true dependencies, WAR write-after-read anti-dependencies, and WAW write-after-write output dependencies. VLIW machines expose scheduling to the compiler/programmer, so independent operations must be packed carefully into issue slots.
-
Data layout transformations often beat instruction tricks. Switching from array-of-structs to struct-of-arrays, blocking/tiling for cache, aligning buffers, and eliminating pointer aliasing can unlock vectorization. But layout changes affect APIs, memory footprint, and maintainability, so justify them with measured impact.
Worked example
For Design a profiling plan for kernels, a strong candidate starts by clarifying the kernel’s purpose, input sizes, target hardware, correctness requirements, and whether the goal is latency, throughput, cost, or energy. They should declare assumptions such as: “I’ll treat this as a deterministic CPU kernel in C++, with a scalar reference and representative production-sized inputs.” The answer can then be organized around four pillars: establish a reliable benchmark, gather coarse-to-fine profiles, form microarchitectural hypotheses, and validate each optimization against correctness and performance regressions.
The benchmark pillar should include warmup, repeated trials, pinned CPU affinity, fixed compiler flags, representative data distributions, and reporting of median, p95, and variance rather than a single best run. The profiling pillar should start with wall-clock attribution, then move to counters like cycles, instructions, IPC, branch misses, cache misses, and memory bandwidth. The hypothesis pillar connects observations to causes: high branch-miss rate suggests branchless rewrite; low IPC with many cache misses suggests layout or blocking; high instruction count suggests strength reduction or vectorization. The validation pillar keeps a golden implementation and randomized/property tests so optimizations do not change semantics.
A specific tradeoff to flag is using -ffast-math: it may enable vectorization and reassociation, but it can break IEEE behavior for NaNs, signed zero, infinities, and reproducibility. A good close is: “If I had more time, I’d inspect generated assembly, run the benchmark on a second CPU generation, and add a CI performance guardrail with a tolerance band to catch regressions.”
A second angle
For Schedule instructions on a VLIW pipeline, the same performance mindset applies, but the task shifts from measuring an opaque out-of-order CPU to explicitly arranging operations for a statically scheduled machine. Instead of asking “why is the CPU stalling?” you ask “which issue slots are unused, and which dependencies prevent filling them?” The candidate should identify RAW/WAR/WAW hazards, operation latencies, functional-unit constraints, and register pressure before proposing a schedule. The same tradeoff appears in a different form: unrolling or software pipelining can improve throughput, but it increases live values and may cause register spills. A strong answer explains both the optimized schedule and how they would validate it using a simulator trace or cycle count.
Common pitfalls
Pitfall: Treating optimization as a bag of tricks instead of an experimental process.
A tempting weak answer is “use SIMD, unroll loops, make it branchless.” That misses the core skill. A better answer says what metric would indicate each intervention, what downside it carries, and how you would prove the change helped.
Pitfall: Ignoring compiler and language semantics.
Low-level changes often cross semantic boundaries: signed integer overflow in C++ is undefined behavior, -ffast-math can alter floating-point results, and pointer aliasing can prevent vectorization unless restrict or layout changes are valid. Interviewers like to test whether you can optimize without making the program subtly wrong.
Pitfall: Over-indexing on microarchitecture while under-communicating the plan.
It is good to mention cache lines, ports, or pipeline hazards, but not as disconnected trivia. Structure the answer around a clear workflow: baseline, profile, diagnose, change, verify, measure again. That makes depth legible to the interviewer.
Connections
The interviewer may pivot from here into systems performance debugging, concurrency and lock contention, memory allocator behavior, or distributed-system tail latency. They may also connect kernel optimization to compiler design, CPU architecture, or GPU-style throughput programming, but for a software engineer the expected focus remains measurement, correctness, and practical tradeoffs.
Further reading
-
Computer Architecture: A Quantitative Approach, Hennessy and Patterson — canonical treatment of pipelines, memory hierarchy, ILP, and performance models.
-
What Every Programmer Should Know About Memory, Ulrich Drepper — detailed explanation of caches, TLBs, prefetching, and memory-access costs.
-
Agner Fog Optimization Manuals — practical references for instruction latency/throughput, calling conventions, vectorization, and assembly-level performance.
Practice questions
ML System Design

What's being tested
These interviews test whether you can design a GPU-backed inference service that meets real latency, throughput, reliability, and cost constraints under multi-tenant load. The interviewer is probing for distributed systems judgment: queueing, batching, routing, autoscaling, failure isolation, observability, and API semantics. For Anthropic, this matters because inference infrastructure sits directly on the product path: poor batching wastes expensive accelerators, poor isolation hurts customers, and poor overload behavior can take down shared capacity. A strong Software Engineer answer should stay at the serving-platform layer, not drift into model architecture or training methodology.
Core knowledge
-
Latency SLOs must be decomposed across the request path: client edge, auth/rate limit, routing, queue wait, GPU execution, post-processing, and streaming. Track
p50,p95,p99, timeout rate, and queue wait separately; an aggregatep99hides whether the bottleneck is scheduling or compute. -
Dynamic batching groups requests arriving within a short window to improve GPU utilization. The key knobs are max batch size, max batch delay, token budget, and compatible model/version. Larger batches increase throughput but add queueing delay, so the policy must be tied to an SLO like “
p95first-token latency < 500 ms.” -
Queueing theory gives the basic danger signal: as utilization approaches 1, queueing latency grows nonlinearly. In practice, keep serving pools below roughly 60–80% sustained utilization if
p99latency matters, because burstiness and long prompts create tail amplification. -
Continuous batching for autoregressive LLM serving differs from simple request batching. New requests can join while existing requests are decoding, and finished sequences leave the batch. Systems like
`vLLM`use paged attention to reduce KV-cache fragmentation and improve memory utilization during long-running generation. -
Prefill vs decode are different workload phases. Prefill processes the input prompt in parallel and is compute-heavy; decode generates one or a few tokens per step and is often memory-bandwidth or scheduling-sensitive. A good design may route, batch, and measure these phases separately.
-
Admission control protects the service before it collapses. Use per-tenant rate limits, max in-flight requests, max prompt length, max output tokens, and queue deadlines. If estimated work exceeds capacity, return
429or503early rather than accepting work that will time out in the queue. -
Routing should consider model ID, model version, tenant tier, region, hardware type, current queue depth, GPU memory availability, and request shape. A basic design uses a control plane for fleet state and a data-plane router using least-loaded or weighted routing with health checks.
-
Multi-tenancy isolation requires fairness, not just authentication. Common approaches include per-tenant queues, weighted fair queueing, token-bucket rate limits, reserved capacity for high-priority tenants, and noisy-neighbor detection. Without this, one customer with long prompts can consume KV cache and degrade everyone else’s
p99. -
GPU memory management is often the hard limit. Model weights, activation buffers, and KV cache compete for memory. For LLMs, KV cache grows roughly with batch size × sequence length × layers × hidden dimension, so “batch more” can trigger out-of-memory failures unless capped by a token budget.
-
Failure handling should distinguish retryable and non-retryable failures. Router or worker crashes can be retried if the request is idempotent; partial streamed responses usually cannot be transparently retried without client-visible semantics. Use deadlines, cancellation propagation, circuit breakers, and draining for deploys.
-
Autoscaling should use workload-aware signals, not just CPU. Better signals include queue depth by model, queue age, GPU utilization, tokens/sec, batch fullness, KV-cache pressure, and SLO burn rate. Scale-up is slow for GPU nodes, so keep warm capacity or predictive buffers for known traffic spikes.
-
Observability needs cardinality discipline and request-shape breakdowns. Emit metrics for
time_to_first_token,tokens_per_second,queue_wait_ms,batch_size,prompt_tokens,completion_tokens, OOM count, retry count, and per-tenant throttling. Logs and traces should include request IDs but avoid storing sensitive prompt text by default.
Worked example
For “Design GPU inference request batching,” start by clarifying the workload: “Are these LLM text generation requests, embeddings, or classification? Do we optimize for time-to-first-token, total completion latency, throughput, or cost? Are requests streamed, and do tenants have different SLOs?” Then declare assumptions: a shared fleet serves multiple model versions, requests have variable prompt and output lengths, and the service has a strict p95 latency target.
Organize the answer around four pillars: request ingress and validation, batching scheduler, GPU worker execution, and observability/autoscaling. At ingress, describe auth, tenant rate limits, request deadlines, token limits, and routing by model/version. In the scheduler, explain per-model queues, compatibility constraints, max batch delay, max batch size, and token-budget-based batching rather than only count-based batching. In the worker, mention loading model weights, managing KV cache, streaming partial outputs, handling cancellation, and returning structured errors.
A strong tradeoff to flag is throughput versus tail latency: waiting 20 ms to build a fuller batch may improve GPU utilization materially, but doing so for an interactive tenant can violate first-token SLOs. You can propose separate classes, such as “interactive” with small max delay and “batch/offline” with larger delay and lower priority. Close by saying that, with more time, you would detail load testing methodology, failure injection, and cost controls such as model placement and warm-pool sizing.
A second angle
For “Design a prompt processing backend,” the same serving concepts apply, but the emphasis shifts toward asynchronous job orchestration and durable state. Instead of optimizing only for interactive latency, you may need an API that accepts a job, returns a job ID, supports idempotent submission, and lets clients poll or receive callbacks. Batching still matters at the GPU layer, but the frontend design also needs job state transitions such as queued, running, succeeded, failed, and cancelled. Retries and dead-letter handling become more central because a background job can survive client disconnects, unlike a purely synchronous inference call.
Common pitfalls
Pitfall: Designing batching as “collect N requests, run them, repeat.”
That answer misses variable sequence lengths, deadlines, tenant priority, and GPU memory limits. A better answer says batches are formed by compatibility and token budget, constrained by max wait time and per-request deadlines.
Pitfall: Talking only about GPU utilization and ignoring user-visible latency.
High utilization is not the product goal; it is a cost-efficiency goal under an SLO. Interviewers expect you to reason about p95/p99, queue wait, time-to-first-token, overload behavior, and what happens when the system is near saturation.
Pitfall: Hand-waving multi-tenancy as “add rate limiting.”
Rate limits are necessary but insufficient. You should also discuss per-tenant queues, weighted fairness, reserved capacity, priority classes, request size limits, and metrics that prove one tenant cannot degrade another tenant’s latency.
Connections
The interviewer may pivot from inference APIs into load balancing, distributed rate limiting, autoscaling, streaming API design, or idempotent job processing. They may also probe adjacent ML-serving concepts such as model rollout, canarying, shadow traffic, and per-version observability, but a Software Engineer should frame these as platform reliability and deployment concerns.
Further reading
-
The Tail at Scale — foundational paper on why
p99latency dominates large-scale user-facing systems. -
vLLM: Easy, Fast, and Cheap LLM Serving with PagedAttention — useful for understanding continuous batching and KV-cache memory pressure in LLM serving.
-
SRE Book: Handling Overload — practical patterns for admission control, load shedding, and graceful degradation.
Practice questions
Machine Learning

What's being tested
This area tests whether you can implement and reason about machine learning primitives at the level needed to build reliable model-facing software: gradients, log-probabilities, tensor shapes, masking, numerical stability, and training-loop invariants. For a Software Engineer, the bar is not inventing new architectures; it is being able to read, modify, debug, and productionize ML code without silently changing model behavior. Anthropic cares because small engineering mistakes in attention masks, log-prob aggregation, or policy-gradient ratios can create large downstream failures: broken evaluations, unstable training, incorrect classifier scores, or wasted accelerator time. Interviewers are probing for a combination of mathematical literacy, implementation discipline, and debugging judgment.
Core knowledge
-
Backpropagation is repeated application of the chain rule over a computation graph. For parameter , compute by caching forward intermediates, propagating upstream gradients backward, and matching every gradient tensor’s shape to its corresponding activation or parameter.
-
Binary cross-entropy should be implemented from logits, not probabilities, for stability. Prefer
BCEWithLogitsLoss-style math:
This avoids overflow fromsigmoid(x)when is large. -
Gradient checking compares analytic gradients to finite differences:
Usefloat64, small networks, and . Expect relative error around1e-6to1e-4; worse often means a broadcasting or reduction bug. -
Scaled dot-product attention computes
where encodes masks. Scaling by prevents logits from growing with embedding dimension and saturating thesoftmax. -
Masking semantics are a common failure point. A causal mask prevents token from attending to future positions ; a padding mask prevents attending to non-real tokens. In
PyTorch, masks must broadcast correctly across batch and head dimensions. -
Numerically stable softmax subtracts the row-wise maximum before exponentiation:
softmax(x) = exp(x - max(x)) / sum(exp(x - max(x))). In mixed precision, usefloat32accumulation for attention scores when possible, especially beforesoftmax. -
Log-probabilities should be aggregated in log space. For binary classification using an LLM, compare candidate label sequences using sums of token log-probs, and normalize if label strings have different lengths. Use log-sum-exp for stable probability normalization.
-
Calibration separates ranking from probability quality. An LLM may assign higher log-prob to the correct class but produce poorly calibrated probabilities. Simple post-hoc methods such as temperature scaling can improve probability estimates without changing the underlying rank order.
-
On-policy reinforcement learning loops require log-probs from the same policy distribution that generated the sampled responses, or carefully bounded importance sampling. In PPO/GRPO-style code, stale samples, recomputed masks, or mismatched tokenization can corrupt ratios.
-
Importance-sampling ratios usually take the form
Compute in log space, clamp or clip only where the algorithm specifies, and inspect ratio histograms for explosions or collapse. -
Advantage normalization improves training stability but must respect grouping and masking semantics. In GRPO-style setups, advantages may be normalized within a group of completions for the same prompt; including padding tokens or mixing unrelated prompts changes the learning signal.
-
Testing ML code requires shape tests, numerical tests, and invariance tests. For attention, test all-masked rows, single-token sequences, causal behavior, padding behavior, and parity with
torch.nn.functional.scaled_dot_product_attentionwhere applicable.
Worked example
For “Implement and analyze custom attention”, a strong candidate starts by clarifying tensor layout: “I’ll assume Q, K, and V have shape [batch, heads, seq_len, head_dim], and I’ll return [batch, heads, query_len, head_dim].” They should ask whether the mask is causal, padding, or both, and whether the implementation must support mixed precision. The answer can be organized into four pillars: compute scores with Q @ K.transpose(-2, -1), scale by sqrt(head_dim), apply masks before softmax, then multiply by V. They should explicitly say that masks should use a large negative additive value or boolean masking, not multiplication by zero, because zeroed logits still receive attention probability. A good implementation discussion includes shape broadcasting, e.g. converting a padding mask from [batch, key_len] to [batch, 1, 1, key_len]. The candidate should flag numerical stability by subtracting the max implicitly via torch.softmax and being careful with -inf in half precision. One tradeoff to mention is readability versus performance: a clean tensorized implementation is acceptable for an interview, while production might use fused kernels such as FlashAttention or scaled_dot_product_attention. A strong close would be: “If I had more time, I’d add tests comparing against a reference implementation across causal masks, padding masks, all-padding edge cases, and float16/bfloat16 behavior.”
A second angle
For “Debug a GRPO training loop and explain ratios”, the same fundamentals show up as log-prob accounting rather than attention math. The core invariant is that token-level log-probs, masks, rewards, and advantages must align over exactly the generated response tokens, not prompt tokens or padding. Instead of asking “does the tensor multiply work,” the interviewer is testing whether you can identify silent training bugs: recomputing old_logprobs with the new model, normalizing advantages across the wrong group, or including masked tokens in the loss. The importance ratio is mathematically simple but operationally fragile because tokenization, batching, and masking must be identical. The best answers combine formulas with concrete debug checks: print shapes, assert mask sums, inspect ratio distributions, and run a tiny deterministic batch.
Common pitfalls
Pitfall: Treating ML primitives as black-box library calls.
Saying “I’d just use torch.autograd” or “I’d call nn.MultiheadAttention” misses the point. It is fine to use libraries in production, but in the interview you need to show that you understand the derivative, masking, or log-prob calculation well enough to catch bad outputs.
Pitfall: Ignoring numerical stability.
Wrong-but-tempting answers include computing log(sigmoid(x)) directly, applying softmax before masking, or exponentiating raw log-prob differences without checking range. Better answers keep values in logit or log-prob space, use logsumexp, subtract maxima before normalization, and inspect for NaN/inf.
Pitfall: Communicating only equations or only code.
A purely mathematical derivation can miss software failure modes like broadcasting bugs, dtype mismatches, and off-by-one causal masks. A purely code-level answer can miss why the code is correct. The strongest answers alternate between invariant, formula, tensor shape, and test.
Connections
Interviewers may pivot from here into model serving, especially batching, latency, and memory tradeoffs for LLM inference. They may also connect to distributed training, including gradient accumulation, mixed precision, and checkpointing, or to evaluation infrastructure, where classifier calibration and log-prob scoring become product-facing reliability concerns.
Further reading
-
Deep Learning by Goodfellow, Bengio, and Courville — strong reference for backpropagation, optimization, and numerical foundations.
-
Attention Is All You Need — original Transformer paper; useful for understanding scaled dot-product attention and masking.
-
Proximal Policy Optimization Algorithms — canonical source for PPO-style ratios, clipping, and policy-gradient stability.
Practice questions
Behavioral & Leadership
What's being tested
Interviewers are probing mission-aligned engineering judgment: whether you can build useful systems while recognizing safety, reliability, and misuse risks that come with deploying advanced AI. For a Software Engineer, this is not about inventing alignment theory or choosing model architectures; it is about how you make technical tradeoffs, escalate uncertainty, design safe defaults, and communicate clearly across engineering, research, security, policy, and product partners. Strong answers show ownership under ambiguity: you can identify risk, reduce it with concrete engineering controls, and still deliver pragmatic progress. Anthropic cares because seemingly ordinary SWE decisions — logging, rate limits, rollout gates, access controls, eval harnesses, incident response, dependency choices — can materially affect whether powerful AI systems behave safely in production.
Core knowledge
-
Risk framing should be concrete: define the harm, affected users, likelihood, blast radius, detection path, and reversibility. A useful shorthand is , then reduce one variable through controls like canaries, access limits, monitoring, or rollback.
-
Defense in depth matters more than any single safety mechanism. For AI-facing systems, layer input validation, policy checks, model/tool permissioning, output filtering, audit logs, abuse detection, human review for high-risk flows, and emergency kill switches rather than relying on one classifier or prompt rule.
-
Safe rollout practices translate values into engineering operations. Mention feature flags, staged rollouts, shadow mode, allowlists,
p95/p99latency monitoring, error-budget gates, abuse-rate dashboards, rollback playbooks, and post-launch review. The key judgment is knowing when slower rollout beats faster shipping. -
Reliability is safety-relevant when AI systems make high-impact recommendations or take tool actions. A timeout, stale cache, partial failure, or retry storm can become a user-facing safety issue. Discuss idempotency keys, circuit breakers, bounded retries with exponential backoff, graceful degradation, and clear user-visible failure states.
-
Access control and least privilege are central when models can access tools, user data, or internal systems. Use scoped service tokens, short-lived credentials, audit trails, separation between read/write permissions, and explicit authorization checks before tool execution. Avoid “temporary” broad admin access that becomes permanent.
-
Observability should detect both conventional failures and safety failures. In addition to
5xx,p99, queue depth, and saturation, track policy-trigger rates, tool-denial rates, anomalous request patterns, prompt-injection attempts, escalation volume, and manual review outcomes. Logs must avoid storing sensitive user data unnecessarily. -
Incident ownership requires accountability without defensiveness. A strong SWE describes timeline, customer impact, root cause, what they personally owned, mitigations shipped, and what changed afterward. Use blameless postmortems, but do not hide behind “the system failed”; identify the engineering decision you would revisit.
-
Cross-functional conflict should be resolved by making assumptions explicit. If research wants broader evaluation, product wants launch speed, and infrastructure worries about reliability, translate disagreement into risks, options, owners, deadlines, and decision criteria. Good leadership is structured escalation, not consensus theater.
-
Ethical judgment is strongest when tied to implementation details. Instead of saying “I care about safety,” explain how you would handle a model behavior that enables abuse: reproduce it, assess severity, gate the feature, notify responsible stakeholders, add tests/evals, and document a launch decision.
-
Impact storytelling needs a clear technical spine. For “most impactful project,” cover system context, constraints, your contribution, architecture or code decisions, measurable outcome, and lessons learned. Metrics can include latency reduction, availability, developer velocity, cost savings, incident reduction, or safer launch posture.
-
Ambiguity management is a core leadership signal. When requirements are underspecified, state assumptions, identify irreversible decisions, create a small prototype or design doc, seek review from domain owners, and define a stopping rule. The interviewer wants to see calibrated confidence, not heroic certainty.
-
Values alignment should sound earned, not rehearsed. Connect your motivation to concrete behaviors: careful code review, willingness to slow down a risky launch, mentorship that raises engineering standards, and curiosity about safety constraints. Avoid claiming expertise in alignment research if your contribution is engineering execution.
Worked example
For “Answer general fit and AI safety questions,” a strong candidate should frame the first 30 seconds by saying: “I’ll answer from the perspective of an engineer building and operating systems around models, not as a researcher designing the model itself.” Then clarify the risk surface: is the system user-facing, does it call external tools, does it access private data, and what is the worst plausible misuse or failure mode? The answer can be organized around four pillars: motivation for Anthropic’s mission, a concrete example of responsible technical judgment, how you collaborate under uncertainty, and how you balance shipping with safety.
A strong skeleton might be: “In a prior project, I owned a service that exposed automated actions to users. The risk was not just uptime; an incorrect action could affect user trust. I added scoped permissions, staged rollout, structured logging, and an emergency disable path before broad release.” The tradeoff to flag explicitly is speed versus reversibility: you may accept a narrower beta and slower adoption if it gives better monitoring and rollback capability. You should avoid sounding like every risk requires a months-long process; instead, show proportionality by severity. Close with something like: “If I had more time, I’d invest in a repeatable pre-launch checklist and regression tests for known safety failures, so the team doesn’t rely on individual memory.”
A second angle
For “Describe failure impact and resolve cross-functional conflict,” the same concept shifts from proactive judgment to recovery and influence. Here the interviewer wants to know whether you can own a bad outcome without becoming defensive, especially when other teams contributed to the failure. Frame the situation around impact first: users affected, duration, severity, data or trust implications, and what was done immediately to stop the bleeding. Then describe how you separated facts from blame, used logs or traces to establish the timeline, and aligned stakeholders on fixes. The safety-aligned answer is not “I convinced everyone I was right”; it is “I created a shared model of risk, got the right decision made, and changed the system so the same failure was less likely.”
Common pitfalls
Pitfall: Giving a values-only answer with no engineering mechanism.
Saying “AI safety is important and I would escalate concerns” is too generic. A stronger answer names the mechanism: feature flag, access control, audit log, eval gate, rollback plan, abuse dashboard, postmortem action item, or explicit launch criterion.
Pitfall: Treating safety as someone else’s job.
It is fair to say you would consult researchers, security, legal, or policy experts, but weak answers outsource all judgment. As a Software Engineer, you still own the quality of the system boundary: permissions, failure modes, observability, testing, deployment, and operational response.
Pitfall: Over-indexing on perfection and blocking all progress.
Anthropic values careful deployment, but leadership judgment includes proportionality. A better answer distinguishes low-risk reversible changes from high-risk irreversible ones, proposes staged exposure, and defines evidence needed to proceed rather than saying “I would not launch until everything is perfectly safe.”
Connections
Interviewers may pivot from this topic into system design for reliable AI products, incident response, security and privacy engineering, or cross-functional leadership. Be ready to connect behavioral examples to concrete design choices like rate limiting, authorization, monitoring, rollback, and data handling.
Further reading
-
Concrete Problems in AI Safety — Classic framing of practical accident risks such as negative side effects, reward hacking, robustness, and safe exploration.
-
Site Reliability Engineering — Useful operational vocabulary for reliability, incident response, error budgets, and production ownership.
-
NIST AI Risk Management Framework — Practical language for identifying, measuring, managing, and governing AI-related risk.
Practice questions