How do I approach System Design interview questions?

System Design questions require understanding of core concepts and practice. PracHub provides solutions with explanations to help you master system design interviews.

What difficulty level is this interview question?

This is a medium difficulty System Design question, commonly asked during Onsite rounds at Harvey.

What role is this question designed for?

This question is commonly asked for Software Engineer candidates at Harvey during technical interviews.

Design a Secure PDF Data Room | Harvey Interview Question

Q: Design a Secure PDF Data Room

This question evaluates a candidate's mastery of system design and secure authorization models for multi-tenant file-sharing platforms, focusing on organization-level access control, data modeling, consistency, scalable serving of large PDF assets, and auditability.

Design a virtual data room product that lets companies organize and securely share confidential PDF documents with other organizations.

The product should feel like a cloud drive: users create data rooms, organize PDFs into folders, upload and view documents, and invite other organizations to access selected content. For this interview, assume the system supports only PDF files at launch.

The central, non-negotiable focus is organization-level access control. A company must be able to grant another organization access to a data room, folder, or individual document with permissions such as view-only or admin. The system must enforce these permissions consistently across browsing, downloading, viewing, and search — there must be no path that leaks a document the caller is not authorized to see.

Constraints & Assumptions

Files: PDF only. Individual files can be large (high-page-count diligence documents), and a single room may hold many thousands of documents.
Tenancy: Multi-tenant. Every user belongs to one or more organizations. A room is owned by one organization and may be shared with many others.
Workloads: Read-heavy (browsing, viewing, search) far outweighs writes (upload, ACL changes). Assume viewing/browsing should feel interactive (sub-second metadata, fast first-page render).
Security & compliance: Documents are highly confidential (legal, financial, M&A diligence). Encryption at rest and in transit is mandatory. A tamper-resistant audit trail is required. Treat "no unauthorized access, ever" as a hard correctness requirement, not a best-effort SLO.
Availability: Prioritize availability for reads/viewing; uploads and ACL writes can tolerate slightly higher latency.

The Problem

Produce an end-to-end design covering: functional & non-functional requirements; the major services and storage choices; a data model for organizations, users, rooms, folders, documents, and ACLs; permission-evaluation rules including inheritance and overrides; APIs for creating rooms, uploading PDFs, inviting organizations, and checking access; how PDFs are served and protected for viewing/downloading; and auditing, logging, and monitoring. Make the authorization model the spine of the design.

Clarifying Questions to Ask

What is the granularity of sharing — can an external org be granted access to a single document, or only to a whole room/folder?
Do we need explicit deny rules, or is allow-only (union of grants) sufficient for v1?
Is view-only access expected to prevent download/printing/screenshotting, or only to gate the download endpoint?
What are the compliance/retention requirements for the audit log (immutability window, retention period, who can read it)?
Do we need full-text search inside PDF content, or only over document/folder names and metadata?
Are there data-residency or per-tenant key-isolation requirements?

What a Strong Answer Covers

A clear split of functional vs. non-functional requirements, with security treated as a correctness constraint.
A single, centralized authorization decision point that every read/write/serve/search path consults — and an explicit argument for why that prevents leaks.
A coherent data model: organizations, users, memberships, rooms, folders (hierarchy), documents (with object-storage keys + lifecycle status), and ACL grants keyed by resource + principal.
A precise, explainable permission-evaluation algorithm: inheritance from room → folder chain → document, the union/override semantics, and how the owning org's implicit rights are handled.
A secure upload pipeline (pre-signed direct upload, validation, async virus/PDF scan, status lifecycle) and a secure serving pipeline (short-lived signed URLs, optional page-rendering + watermarking).
ACL-aware search that cannot return unauthorized documents, with the index-vs-post-filter tradeoff named.
A tamper-resistant audit log: what events are captured, the event schema, append-only storage, and retention.
Caching/consistency reasoning that does not trade away authorization correctness (short TTLs, invalidation on ACL change, fast-expiring URLs).
Scalability choices (blobs in object storage, async workers, partitioned audit logs) and the key tradeoffs, stated as tradeoffs.

Follow-up Questions

A document is moved into a folder with more restrictive sharing. How do you ensure the effective permissions update atomically, and how do you handle in-flight signed URLs already issued under the old permissions?
An org's access to a room is revoked. What is the maximum window during which a previously-cached authorization decision or an outstanding signed URL could still grant access, and how do you bound it?
How would you extend the ACL model to support per-user (not just per-org) exceptions and time-bounded access (e.g., access that expires at a deal's close) without making permission evaluation unexplainable?
How would you support legally-defensible "who viewed what, when" reporting and detect anomalous bulk-download behavior?

Design a virtual data room product that lets companies organize and securely share confidential PDF documents with other organizations.

Constraints & Assumptions

Files: PDF only. Individual files can be large (high-page-count diligence documents), and a single room may hold many thousands of documents.
Tenancy: Multi-tenant. Every user belongs to one or more organizations. A room is owned by one organization and may be shared with many others.
Workloads: Read-heavy (browsing, viewing, search) far outweighs writes (upload, ACL changes). Assume viewing/browsing should feel interactive (sub-second metadata, fast first-page render).
Security & compliance: Documents are highly confidential (legal, financial, M&A diligence). Encryption at rest and in transit is mandatory. A tamper-resistant audit trail is required. Treat "no unauthorized access, ever" as a hard correctness requirement, not a best-effort SLO.
Availability: Prioritize availability for reads/viewing; uploads and ACL writes can tolerate slightly higher latency.

The Problem

Clarifying Questions to Ask

What is the granularity of sharing — can an external org be granted access to a single document, or only to a whole room/folder?
Do we need explicit deny rules, or is allow-only (union of grants) sufficient for v1?
Is view-only access expected to prevent download/printing/screenshotting, or only to gate the download endpoint?
What are the compliance/retention requirements for the audit log (immutability window, retention period, who can read it)?
Do we need full-text search inside PDF content, or only over document/folder names and metadata?
Are there data-residency or per-tenant key-isolation requirements?

What a Strong Answer Covers

A clear split of functional vs. non-functional requirements, with security treated as a correctness constraint.
A single, centralized authorization decision point that every read/write/serve/search path consults — and an explicit argument for why that prevents leaks.
A coherent data model: organizations, users, memberships, rooms, folders (hierarchy), documents (with object-storage keys + lifecycle status), and ACL grants keyed by resource + principal.
A precise, explainable permission-evaluation algorithm: inheritance from room → folder chain → document, the union/override semantics, and how the owning org's implicit rights are handled.
A secure upload pipeline (pre-signed direct upload, validation, async virus/PDF scan, status lifecycle) and a secure serving pipeline (short-lived signed URLs, optional page-rendering + watermarking).
ACL-aware search that cannot return unauthorized documents, with the index-vs-post-filter tradeoff named.
A tamper-resistant audit log: what events are captured, the event schema, append-only storage, and retention.
Caching/consistency reasoning that does not trade away authorization correctness (short TTLs, invalidation on ACL change, fast-expiring URLs).
Scalability choices (blobs in object storage, async workers, partitioned audit logs) and the key tradeoffs, stated as tradeoffs.

Follow-up Questions

A document is moved into a folder with more restrictive sharing. How do you ensure the effective permissions update atomically, and how do you handle in-flight signed URLs already issued under the old permissions?
An org's access to a room is revoked. What is the maximum window during which a previously-cached authorization decision or an outstanding signed URL could still grant access, and how do you bound it?
How would you extend the ACL model to support per-user (not just per-org) exceptions and time-bounded access (e.g., access that expires at a deal's close) without making permission evaluation unexplainable?
How would you support legally-defensible "who viewed what, when" reporting and detect anomalous bulk-download behavior?

Design a Secure PDF Data Room

Quick Overview

Constraints & Assumptions

The Problem

Clarifying Questions to Ask

What a Strong Answer Covers

Follow-up Questions

Solution

Submit Your Answer to Earn 20XP

Design a Secure PDF Data Room

Quick Overview

Constraints & Assumptions

The Problem

Clarifying Questions to Ask

What a Strong Answer Covers

Follow-up Questions

Solution

Submit Your Answer to Earn 20XP