Harvey Software Engineer Interview Prep Guide
Everything Harvey actually asks Software Engineer candidates — concept walkthroughs, worked examples, and the real interview questions, drawn from candidate reports. Free to read.
Last updated

Technical Screen
Coding & Algorithms

What's being tested
This tests hierarchical data modeling with a trie/tree of directories and files, plus robust path parsing and name-collision handling. Interviewers are looking for clean object design, edge-case discipline, and ability to reason about `mkdir`, `ls`, `addContentToFile`, `readContentFromFile`, capacity limits, and duplicate names.
Patterns & templates
-
Trie/tree node model — each
`Node`stores`children: Map[str, Node]`,`isFile`, optional`content`, metadata, and size; operations areO(path components). -
Path normalization — split on
/, ignore empty components, handle root/; avoid bugs from trailing slashes like/a/b/. -
Directory traversal helper — implement
`getNode(path, create=False)`to centralize validation, auto-creation, and missing-path behavior. -
Lexicographic listing —
`ls(path)`returns[filename]for files or sorted child names for directories; sorting costsO(k log k). -
Content accounting — for constrained systems, track total bytes and per-file size deltas before appending; reject writes that exceed capacity.
-
Duplicate naming policy — OS-style collisions often require generating
`name(1)`,`name(2)`; maintain counters or scan siblings carefully. -
Complexity explanation — use
d = path depth,k = directory children,c = content length; traversalO(d), append/read affected byc.
Common pitfalls
Pitfall: Treating files and directories as separate maps makes traversal and collision handling harder than a unified
`Node`abstraction.
Pitfall: Forgetting that
`ls("/a/file.txt")`should usually return only["file.txt"], not file content or children.
Pitfall: Capacity checks must account for append delta, not total rewritten content unless the API replaces file contents.
Practice these
The practice cards below cover the canonical variants — solve all of them and time yourself.
Practice questions
-
Spreadsheet Formula Engine — covered in depth under Onsite below.
-
Expression Parsing — covered in depth under Onsite below.
-
Dependency Graphs — covered in depth under Onsite below.

What's being tested
Tests exact substring matching across one or more source strings, then converting raw match positions into correctly tagged output. Interviewers are probing for clean interval generation, overlap/adjacency merging, deterministic handling of multiple sources, and bug-free string reconstruction.
Patterns & templates
-
Brute-force matching with
`find_all(text, pattern)`— loop using`str.find(pattern, start)`; advance by 1 to allow overlapping matches. -
Interval collection — convert every match to half-open
[start, end)intervals; half-open bounds simplify merging and output slicing. -
Sort-and-merge intervals — sort by
(start, end), then merge whennext.start <= cur.end; use<if adjacent matches should stay separate. -
Multi-pattern search — for many source strings, use Aho-Corasick for
`O(n + total_matches + total_pattern_length)`instead of nested scans. -
Output reconstruction — append unmatched text, opening tag, matched slice, closing tag; never mutate the string while iterating indices.
-
Counting occurrences — keep a map like
`source_id -> count`or`pattern -> count`; define whether overlapping occurrences count before coding. -
Edge-case template — handle empty sources, duplicate patterns, empty pattern strings, case sensitivity, punctuation, Unicode, and repeated adjacent matches.
Common pitfalls
Pitfall: Advancing
`start += len(pattern)`misses overlapping matches like finding"aa"twice in"aaa".
Pitfall: Inserting tags directly into the original string shifts later indices and corrupts positions; build a new output buffer instead.
Pitfall: Forgetting to merge adjacent intervals can produce
`<b>abc</b><b>def</b>`when the expected output is`<b>abcdef</b>`.
Practice these
The practice cards below cover the canonical variants — solve all of them and time yourself.
Practice questions
System Design

What's being tested
This tests whether you can design a production-grade distributed storage system with clean API boundaries, durable object storage, strongly modeled metadata, and well-scoped consistency guarantees. The interviewer is probing for how you separate file bytes from metadata, handle concurrency for operations like rename and move, support large resumable uploads, and reason about failures without hand-waving. Harvey cares because legal and enterprise workflows depend on correctness around documents: losing a file, exposing the wrong version, or corrupting folder state is unacceptable. A strong Software Engineer answer should show practical distributed-systems judgment: which components need strong consistency, which can be eventually consistent, and where to use transactions, idempotency, and background repair.
Core knowledge
-
Storage/metadata separation is the starting architecture: store file contents as immutable blobs in
`S3`,`GCS`, or an internal object store, and store directory trees, permissions, versions, and upload state in a transactional metadata service like`Postgres`,`MySQL`,`Spanner`, or`DynamoDB`. -
Object keys should not be user-visible paths. Use stable IDs such as
`file_id`,`version_id`, and`blob_id`;/client/matter/doc.pdfis metadata. This prevents rename from requiring object rewrites and lets the same blob be referenced by multiple versions or deduplicated. -
Directory modeling usually uses either an adjacency list table, with
`parent_id`and`name`, or a materialized path / closure table for faster subtree queries. Adjacency lists are simple and transactional; materialized paths make subtree listing easier but complicate renames and moves. -
Uniqueness constraints are essential for correctness. A typical table has
UNIQUE(parent_id, name)to prevent two entries with the same name in one folder. For case-insensitive filesystems, normalize names into`normalized_name`and enforceUNIQUE(parent_id, normalized_name). -
Atomic rename/move should be a metadata transaction, not a blob operation. Update
`parent_id`,`name`,`updated_at`, and possibly path cache inside one database transaction. Validate permissions, destination existence, name uniqueness, and cycle prevention before commit. -
Per-directory limits need concurrency-safe enforcement. If a folder has max
Nchildren, do not readCOUNT(*)and then insert without protection. Use a locked counter row,SELECT ... FOR UPDATE, serializable isolation, or a conditional update likeUPDATE folders SET child_count = child_count + 1 WHERE id = ? AND child_count < N. -
Transaction boundaries should keep external object storage out of database transactions. Common flow: create upload session in metadata, upload bytes to object storage, verify checksum, then commit a file version referencing the blob. Use a garbage collector for orphaned uploaded blobs.
-
Idempotency keys protect create/upload/commit APIs from client retries. Store
(user_id, operation, idempotency_key) -> responsewith a TTL or durable record. Stripe’s idempotency-key pattern is useful: the same key and parameters return the same result; parameter mismatch returns an error. -
Large-file uploads should support multipart upload and resumability. Split files into chunks, track uploaded parts with part numbers, byte ranges,
ETags, and checksums. Finalize only after all required parts are present and the full-file checksum matches, e.g.SHA-256(file) == expected_hash. -
Versioning is usually append-only.
filesrepresents the logical document,file_versionsrepresents immutable content snapshots, andcurrent_version_idpoints to the latest committed version. This supports rollback, audit history, collaborative edits, and safe reads during concurrent writes. -
Consistency model should be explicit. Metadata operations like create, rename, delete, and permission changes generally require strong consistency. Blob reads can be served from object storage or CDN with cache validation, but clients should resolve file identity and version through strongly consistent metadata.
-
Sync clients need change tracking, not recursive polling. Maintain a monotonically increasing
change_seqor per-namespace log of events: create, update, rename, delete, permission change. Clients callGET /changes?cursor=...; if the cursor is too old, force a full resync.
Worked example
For Design a production file storage service, start by clarifying scope: “Are we designing Dropbox-like user storage, enterprise document storage, or a backend service? Do we need folders, rename, permissions, file versioning, and per-directory limits? What scale should I assume for file count, object size, and QPS?” Then declare assumptions: file bytes are large and immutable, metadata must be strongly consistent, and object storage is durable but not transactional with the metadata database.
A strong answer can be organized around four pillars: API design, metadata schema, blob storage/upload flow, and concurrency/failure handling. For APIs, define CreateFolder, StartUpload, UploadPart, CommitUpload, Rename, Move, ListFolder, GetDownloadUrl, and Delete. For schema, separate nodes or files/folders, file_versions, blobs, upload_sessions, and idempotency_keys, with constraints like UNIQUE(parent_id, normalized_name).
The key tradeoff to flag is metadata consistency versus storage scalability: keep rename and directory limits in a transactional database, while storing bytes externally in `S3`-style storage and reconciling through background cleanup. When discussing per-directory limits, call out the race condition explicitly: two concurrent creates can both see 999 children and insert, so the limit must be enforced with a locked counter or conditional write. Close by saying that, with more time, you would add permission inheritance, audit logging, cross-region replication, and disaster recovery testing.
A second angle
For Design a Cloud File Storage Service, the same architecture applies, but the emphasis shifts toward client synchronization, collaboration, and user-visible behavior. You still separate immutable blobs from strongly consistent metadata, but now you should spend more time on GET /changes, conflict handling, offline edits, and version history. A rename might be represented as a metadata event rather than a delete-plus-create, so sync clients preserve identity and avoid re-downloading unchanged bytes. Resumable upload becomes more prominent because cloud clients may upload multi-GB files over unreliable networks. The design should state whether last-writer-wins is acceptable or whether conflicting versions are preserved as separate file versions.
Common pitfalls
Pitfall: Treating paths as the source of truth.
A tempting answer is to store files directly under keys like user_id/folder/doc.pdf and “rename” by copying objects. That fails for large files, concurrent renames, and version history. A better answer uses stable IDs for logical files and treats paths as mutable metadata.
Pitfall: Saying “use
S3andPostgres” without defining invariants.
The interviewer is not testing whether you know popular services; they are testing whether you can preserve correctness under retries, partial failures, and concurrent operations. State invariants such as “no duplicate child names under the same parent,” “a committed version always references an existing blob,” and “a folder’s child count never exceeds its configured limit.”
Pitfall: Over-indexing on scalability while ignoring atomicity.
Some candidates jump immediately to sharding, CDNs, and queues, then miss the hard correctness problem in rename, move, and create. Scalability matters, but for this system the sharp edge is transactional metadata. First design one correct shard or namespace, then explain how to partition by tenant_id, drive_id, or root folder once the invariants are clear.
Connections
Interviewers can pivot from this into distributed transactions, object storage internals, sync protocols, access control, or multi-region replication. Be ready to compare strong consistency versus eventual consistency, discuss cache invalidation for download URLs, and explain how audit logs or event streams support search indexing and client sync without becoming the source of truth.
Further reading
-
The Google File System — classic paper on distributed storage tradeoffs, chunking, replication, and metadata coordination.
-
Amazon DynamoDB paper — useful background on partitioning, availability, and eventual consistency, even if your metadata layer uses stronger transactions.
-
Stripe API Idempotency Keys — practical reference for designing retry-safe create and commit APIs.
Practice questions

What's being tested
This tests whether you can design a file identity mechanism that is correct, scalable, and practical under real I/O constraints. The interviewer is probing for how you reason about content equality, hashing, collision risk, streaming reads, and persistent indexes for deduplication. For a Software Engineer at Harvey, this matters because document-heavy systems often need to avoid duplicate storage, verify uploads, compare versions, and preserve correctness when user-visible legal or business documents are involved. A strong answer balances theoretical guarantees with engineering pragmatism: when to hash, when to compare bytes, how to handle huge files, and how to make the system work across machines.
Core knowledge
-
File identity should start with a crisp definition: identical by bytes, identical by normalized content, or identical ignoring filesystem metadata like filename,
mtime, permissions, owner, and path. Most system design answers should assume byte-for-byte equality unless explicitly asked to normalize. -
Metadata prefilters reduce expensive work. Compare file size first; unequal sizes imply unequal byte content. For local files, size plus inode/device can identify hard links. For distributed storage, size is a cheap index key but never a proof of equality.
-
Cryptographic hashes like
SHA-256orBLAKE3are standard for content identity because they are collision-resistant. A hash maps arbitrary bytes to fixed width:SHA-256gives 256 bits, so random collision probability is approximately for files. -
Non-cryptographic hashes like
xxHash,MurmurHash3, orCRC32Care faster but not collision-resistant. They can be useful as early filters or checksums, but for identity in adversarial or correctness-sensitive contexts, preferSHA-256,SHA-512/256, orBLAKE3. -
Collision handling is still part of a complete design. Even with
SHA-256, store(size, hash)and, when absolute certainty is required, perform a final byte-by-byte comparison for candidates sharing both size and hash. This is especially relevant for legal documents or adversarial uploads. -
Streaming I/O avoids loading large files into memory. Read chunks such as
4 MiBor8 MiB, update the hash incrementally, and keep memory per file. In most systems, hashing throughput is bounded by disk or network I/O before CPU, unless using very fast localNVMeor many parallel workers. -
Partial hashing can improve dedup search but is not sufficient for identity. For example, hash the first and last
64 KiBplus size to bucket likely duplicates, then compute the full hash only for candidate groups. This is a performance optimization, not a correctness proof. -
Persistent hash indexes support scalable deduplication. A common table schema is
(tenant_id, size_bytes, content_hash, storage_pointer, ref_count, created_at), with a composite index on(tenant_id, size_bytes, content_hash). Tenant scoping matters when dedup across customers has privacy or policy implications. -
Concurrency control is required when two identical files upload simultaneously. Use a database uniqueness constraint on
(tenant_id, size_bytes, content_hash)or an atomic compare-and-insert flow. Without this, two workers can compute the same hash and both store duplicate blobs. -
Chunk-level deduplication divides files into blocks and hashes each block. Fixed-size chunks are simple but shift poorly when bytes are inserted near the beginning. Content-defined chunking, such as
Rabin fingerprinting, handles insertions better but adds complexity and metadata overhead. -
Ignoring metadata means reading file contents, not relying on filesystem attributes. Filename, timestamps, extended attributes, permissions, and storage location should be excluded unless the product definition says they are part of identity. Be explicit about symlinks: compare link target bytes, link path, or resolved file contents.
-
Distributed environments add failure modes around retries and object storage consistency. Compute the hash while streaming upload, store it with the blob in
S3-like object storage, and make the database commit the source of truth. Treat upload retry idempotency separately from file content identity.
Worked example
For Determine if two files are identical, a strong candidate starts by clarifying: “Do we mean byte-for-byte identical, and can the files be larger than memory? Are these local files or remote objects? Is adversarial input possible?” Then they declare assumptions: byte equality, potentially large files, and correctness matters more than shaving a few milliseconds.
The skeleton answer has four pillars. First, compare cheap metadata: if size_bytes differs, return not identical. Second, stream both files in chunks and either compare chunks directly or compute hashes incrementally. Third, if using hashes, choose SHA-256 or BLAKE3, not MD5, and explain collision risk. Fourth, handle operational concerns: read errors, permissions, symlinks, and not loading entire files into memory.
A good design decision to flag is direct streaming comparison versus hashing. If you only need to compare two files once, direct chunk-by-chunk comparison can be better: it exits early on the first mismatch and avoids storing any digest. If you will compare the same files against many others later, computing and persisting a content hash pays off.
A polished close would be: “For a one-off local comparison, I’d do size check plus chunked byte comparison. For a deduplication system, I’d compute a strong streaming hash, index by (size, hash), and optionally byte-compare on collision candidates. If I had more time, I’d cover chunk-level deduplication and concurrent upload races.”
A second angle
For Determine identical files ignoring metadata, the same core concept applies, but the key move is separating content identity from file record identity. The candidate should explicitly exclude filename, path, permissions, mtime, owner, and storage-specific metadata from the equality decision. The design then becomes: read the content stream, compute a stable content hash, and compare (size, hash) rather than comparing filesystem stat structures.
The constraints can change if “metadata” includes embedded document metadata, such as PDF author fields or timestamps inside the file bytes. In that case, byte-for-byte hashing is no longer enough; you need a normalization layer that parses the document and emits canonical bytes before hashing. That is a different and riskier design because parsers, canonicalization rules, and format-specific edge cases become part of correctness.
Common pitfalls
Pitfall: Saying “just use
MD5and compare the hash” as a complete answer.
MD5 is fast but broken for collision resistance, and a hash alone is never a mathematical proof of equality. A better answer is size prefilter, streaming SHA-256 or BLAKE3, and final byte comparison if the domain requires absolute certainty.
Pitfall: Ignoring I/O and memory behavior.
A tempting but weak answer is “read both files into memory and compare strings.” That fails for multi-gigabyte files, remote object stores, and high-concurrency services. Strong candidates talk about chunked reads, backpressure, bounded memory, and whether the bottleneck is disk, network, or CPU.
Pitfall: Blurring product semantics with implementation.
When asked to ignore metadata, some candidates still include path, filename, or mtime in their equality check. Others overcorrect and claim two PDFs with different embedded timestamps are identical without defining normalization. The right move is to clarify which metadata is external filesystem metadata versus bytes inside the file format.
Connections
Interviewers may pivot from here into content-addressable storage, deduplication indexes, Merkle trees, object storage consistency, or idempotent uploads. They may also ask for a deeper algorithmic comparison between direct byte comparison, whole-file hashing, partial hashing, and chunk-level hashing.
Further reading
-
RFC 6234: US Secure Hash Algorithms — practical reference for
SHA-1,SHA-224,SHA-256,SHA-384, andSHA-512. -
BLAKE3 — modern high-throughput cryptographic hash with tree hashing and parallelism.
-
Venti: A New Approach to Archival Storage — classic content-addressable storage paper using hashes to identify immutable blocks.
Practice questions
Onsite
Coding & Algorithms

What's being tested
These problems test stateful data structure design for an in-memory spreadsheet: parsing formulas, resolving cell references, maintaining a dependency graph, and updating values incrementally. Interviewers are looking for clean APIs, correct invalidation order, and robust handling of cycles and stale cached values.
Patterns & templates
-
Cell model: store
rawinput, parsedexpr, cachedvalue,dependencies, anddependents; separates user state from computed state. -
Expression parsing: implement recursive descent or token scan for
+,-,*,/, parentheses, numbers, and cell refs; define precedence explicitly. -
Dependency graph: on
set(cell, formula), remove old edges, add newcell -> referencedCelledges, then update reverse dependents for propagation. -
Cycle detection: run DFS with
visiting/visitedstates before committing formula; rejectA1 = B1 + 1,B1 = A1 + 1. -
Incremental recomputation: after a leaf update, traverse reverse dependencies with topological DFS/BFS; recompute only affected cells in dependency-safe order.
-
Lazy evaluation alternative: mark dependents dirty and compute on
get(cell)via DFS memoization; simpler for sparse reads, worse for repeated hot reads. -
Complexities:
setisO(F + A)whereFis formula size andAaffected cells;getisO(1)eager orO(D)lazy.
Common pitfalls
Pitfall: Forgetting to delete old dependency edges means cells keep depending on references from previous formulas.
Pitfall: Detecting cycles only during
getcan leave the spreadsheet in an invalid committed state; validate before committing updates.
Pitfall: Recomputing every cell after each update is correct but usually fails the intended incremental-evaluation signal.
Practice these
The practice cards below cover the canonical variants — solve all of them and time yourself.
Practice questions

What's being tested
These problems test expression parsing plus stateful dependency management: parse formulas, resolve cell references, maintain a dependency graph, and recompute affected cells correctly. Interviewers are probing whether you can turn a spreadsheet-like API into clean data structures with cycle detection and incremental evaluation.
Patterns & templates
-
Recursive descent parsing for formulas like
=A1+B2*3; implementparseExpr,parseTerm,parseFactorto enforce precedence. -
Tokenization separates numbers, operators, parentheses, and cell refs; keep it
O(L)per formula length and reject malformed tokens early. -
Dependency graph maps
cell -> dependenciesand reversecell -> dependents; updates need both directions for efficient invalidation. -
DFS cycle detection with
visiting/visitedstates; run before committing a formula to catchA1 -> B1 -> A1. -
Incremental recomputation uses reverse graph traversal from changed cells; topologically evaluate dirty dependents in
O(V+E)over impacted subgraph. -
Memoized evaluation via
eval(cell)avoids repeated recomputation inside one update; invalidate cache entries reachable from changed cells. -
Stateful API design separates
setValue(cell, n),setFormula(cell, expr), andgetValue(cell); define behavior for missing cells, self-references, and errors.
Common pitfalls
Pitfall: Only evaluating formulas on
getValuewithout invalidation can return stale values after upstream cells change.
Pitfall: Detecting cycles only during evaluation often leaves the spreadsheet in a partially corrupted state; validate before committing updates.
Pitfall: Forgetting to remove old dependencies when replacing a formula causes phantom updates and incorrect graph edges.
Practice these
The practice cards below cover the canonical variants — solve all of them and time yourself.
Practice questions

What's being tested
These prompts test dependency graph maintenance for spreadsheet formulas: parsing expressions, tracking cell references, detecting cycles, and incrementally recomputing affected cells. A strong solution separates cell state, formula evaluation, and graph propagation instead of recomputing the whole sheet after every update.
Patterns & templates
-
Bidirectional graph: maintain
deps[cell] -> referenced_cellsandrevDeps[cell] -> dependents; update both whensetFormula(cell, expr)changes references. -
Expression parsing: tokenize formulas like
=A1+B2*3; use recursive descent or shunting-yard for precedence, parentheses, literals, and cell references. -
DFS cycle detection: before committing a formula, run
hasPath(ref, cell)or color-state DFS; reject formulas creating back edges. -
Topological recomputation: after
setValueorsetFormula, traverserevDepsfrom the changed cell and recompute in dependency order. -
Lazy vs eager evaluation: eager updates make
get(cell)O(1)but updates costO(affected); lazy evaluation shifts work toget. -
Transactional updates: parse and validate first, then mutate
deps,revDeps, and stored formula/value; rollback is simpler if validation happens before commit. -
Complexity target: update should be
O(size of affected subgraph + expression sizes), notO(total cells)unless constraints are tiny.
Common pitfalls
Pitfall: Only storing forward dependencies makes propagation hard; you need reverse edges to find cells invalidated by an update.
Pitfall: Detecting cycles only during evaluation can leave the spreadsheet in a corrupted state; validate before committing the new formula.
Pitfall: Forgetting to remove old formula edges causes stale dependents and incorrect recomputations after overwriting a formula.
Practice these
The practice cards below cover the canonical variants — solve all of them and time yourself.
Practice questions