Design a Secure PDF Data Room
Company: Harvey
Role: Software Engineer
Category: System Design
Difficulty: medium
Interview Round: Onsite
Design a **virtual data room** product that lets companies organize and securely share confidential PDF documents with other organizations.
The product should feel like a cloud drive: users create *data rooms*, organize PDFs into folders, upload and view documents, and invite other organizations to access selected content. For this interview, assume the system supports **only PDF files** at launch.
The central, non-negotiable focus is **organization-level access control**. A company must be able to grant another organization access to a data room, folder, or individual document with permissions such as *view-only* or *admin*. The system must enforce these permissions **consistently** across browsing, downloading, viewing, and search — there must be no path that leaks a document the caller is not authorized to see.
### Constraints & Assumptions
- **Files:** PDF only. Individual files can be large (high-page-count diligence documents), and a single room may hold many thousands of documents.
- **Tenancy:** Multi-tenant. Every user belongs to one or more organizations. A room is owned by one organization and may be shared with many others.
- **Workloads:** Read-heavy (browsing, viewing, search) far outweighs writes (upload, ACL changes). Assume viewing/browsing should feel interactive (sub-second metadata, fast first-page render).
- **Security & compliance:** Documents are highly confidential (legal, financial, M&A diligence). Encryption at rest and in transit is mandatory. A tamper-resistant audit trail is required. Treat "no unauthorized access, ever" as a hard correctness requirement, not a best-effort SLO.
- **Availability:** Prioritize availability for reads/viewing; uploads and ACL writes can tolerate slightly higher latency.
### The Problem
Produce an end-to-end design covering: functional & non-functional requirements; the major services and storage choices; a data model for organizations, users, rooms, folders, documents, and ACLs; permission-evaluation rules including inheritance and overrides; APIs for creating rooms, uploading PDFs, inviting organizations, and checking access; how PDFs are served and protected for viewing/downloading; and auditing, logging, and monitoring. Make the **authorization model** the spine of the design.
```hint Where to start
Separate the two hard sub-problems: (1) a metadata/ACL system that needs strong consistency, and (2) bulk PDF storage + serving that needs scale and short-lived secure access. Object storage for blobs, a relational DB for metadata and ACLs.
```
```hint Where do access decisions live?
List every path that can surface a document — browse, view, download, search, any API. If each one re-implements its own check, what's the chance they stay perfectly in sync as the system grows? Think about what structural choice would make a leak hard *by construction* rather than by careful review.
```
```hint Inheritance & evaluation
ACLs are grants attached to resources (`room` / `folder` / `document`) for principals (chiefly `organization`). How would you compute a document's effective permissions from its room grant, ancestor-folder grants, and any direct grant? Decide early whether v1 even needs explicit `deny` — and if you allow it, work out which way a conflict between an allow and a deny must resolve, and what that costs in explainability.
```
```hint How do bytes actually reach the viewer?
If the app tier streamed every PDF itself it wouldn't scale; if it handed out a durable storage URL, who could re-share it? Find a middle path that offloads the bytes yet stays gated by your authorization check. Then ask what "view-only" should mean for a document the user can still screenshot — and how you'd make a leaked copy traceable back to who leaked it.
```
```hint What does search do with permissions?
A full-text index that ignores ACLs will happily return documents the caller can't see. How does authorization interact with the index — do you filter before, during, or after the query, and what does each choice cost you in latency and in staleness when an ACL changes mid-flight? Whatever you pick, ask whether it can ever *show* a document that was just revoked.
```
### Clarifying Questions to Ask
- What is the granularity of sharing — can an external org be granted access to a single document, or only to a whole room/folder?
- Do we need explicit *deny* rules, or is allow-only (union of grants) sufficient for v1?
- Is view-only access expected to prevent download/printing/screenshotting, or only to gate the download endpoint?
- What are the compliance/retention requirements for the audit log (immutability window, retention period, who can read it)?
- Do we need full-text search inside PDF content, or only over document/folder names and metadata?
- Are there data-residency or per-tenant key-isolation requirements?
### What a Strong Answer Covers
- A clear split of functional vs. non-functional requirements, with security treated as a correctness constraint.
- A **single, centralized authorization decision point** that every read/write/serve/search path consults — and an explicit argument for why that prevents leaks.
- A coherent data model: organizations, users, memberships, rooms, folders (hierarchy), documents (with object-storage keys + lifecycle status), and ACL grants keyed by resource + principal.
- A precise, *explainable* permission-evaluation algorithm: inheritance from room → folder chain → document, the union/override semantics, and how the owning org's implicit rights are handled.
- A secure upload pipeline (pre-signed direct upload, validation, async virus/PDF scan, status lifecycle) and a secure serving pipeline (short-lived signed URLs, optional page-rendering + watermarking).
- ACL-aware search that cannot return unauthorized documents, with the index-vs-post-filter tradeoff named.
- A tamper-resistant audit log: what events are captured, the event schema, append-only storage, and retention.
- Caching/consistency reasoning that does not trade away authorization correctness (short TTLs, invalidation on ACL change, fast-expiring URLs).
- Scalability choices (blobs in object storage, async workers, partitioned audit logs) and the key tradeoffs, stated as tradeoffs.
### Follow-up Questions
- A document is moved into a folder with *more restrictive* sharing. How do you ensure the effective permissions update atomically, and how do you handle in-flight signed URLs already issued under the old permissions?
- An org's access to a room is revoked. What is the maximum window during which a previously-cached authorization decision or an outstanding signed URL could still grant access, and how do you bound it?
- How would you extend the ACL model to support per-user (not just per-org) exceptions and time-bounded access (e.g., access that expires at a deal's close) without making permission evaluation unexplainable?
- How would you support legally-defensible "who viewed what, when" reporting and detect anomalous bulk-download behavior?
Quick Answer: This question evaluates a candidate's mastery of system design and secure authorization models for multi-tenant file-sharing platforms, focusing on organization-level access control, data modeling, consistency, scalable serving of large PDF assets, and auditability.