Design an Audit Logs Service
Company: Snowflake
Role: Software Engineer
Category: System Design
Difficulty: medium
Interview Round: Technical Screen
# Design an Audit Logs Service
Design a backend service that records **audit log events** for a multi-tenant SaaS platform and lets each account's users **query the audit history for their own account**.
An audit event captures "who did what, to which resource, when, and from where" — for example, a user signed in, changed a permission, deleted a record, rotated an API key, or exported data. These events are written by many internal services across the platform and are read back by end users (and their admins) through a UI that supports filtering and pagination. Audit logs are also relied on for security investigations and compliance, so they must be durable, tamper-evident, and queryable for a long retention window.
The interviewer is specifically interested in three things: **how you store the event/transaction log**, **how the system scales** as event volume and tenant count grow, and **how you keep query latency low** for the user-facing read path.
### Constraints & Assumptions
- Multi-tenant: events always belong to exactly one `account_id` (tenant). Cross-tenant reads are never allowed.
- Write volume: assume a high, bursty ingest rate (e.g., on the order of 50k–100k events/sec at peak across all tenants), heavily skewed toward a few large tenants.
- Read pattern: user-facing queries are scoped to a single account, usually filtered by time range and optionally by actor, action type, or resource; results are paginated (e.g., 50 per page) and typically hit recent data.
- Retention: events must be retained and queryable for a long window (e.g., 1–2 years), with much older data allowed to move to colder, slower storage.
- Latency target: p99 for an interactive query page is on the order of a few hundred milliseconds; ingestion can be asynchronous and is allowed to be eventually consistent (a few seconds of write-to-readable lag is acceptable).
- Durability/integrity: an accepted event must not be silently lost, and the audit trail should be append-only and resistant to tampering.
### Clarifying Questions to Ask
- What is the acceptable write-to-read lag — must a user see their own action immediately, or is a few seconds of delay fine?
- Are events immutable and append-only, or can they be edited/deleted (e.g., for legal "right to be forgotten" requests)?
- What query dimensions must be supported (time range only, or also actor / action / resource / IP), and is free-text search required?
- What are the retention and compliance requirements (duration, immutability guarantees, export, who can read)?
- What is the tenant size distribution — do we need to handle a few "whale" tenants that dwarf everyone else?
- What read QPS and per-query result-size limits should we plan for, and is the query strictly per-account or are there cross-account admin/security views?
### Part 1 — Ingestion and storage of the event log
Define the audit event schema and design the **write path**: how producing services submit events, how you guarantee accepted events are durably stored without adding latency or coupling to the producers, and what the primary storage model is for the log.
```hint Where to start
Treat the producers and the durable store as decoupled. A durable, partitioned append-only log (e.g., a Kafka-style topic) in front of storage absorbs bursts, decouples producers from the database, and gives you replay for free.
```
```hint Schema and partitioning
Model an event as an immutable row: `event_id`, `account_id`, `actor_id`, `action`, `resource_type`, `resource_id`, `timestamp`, `ip`, `metadata` (JSON). Think hard about the partition/sort key — `account_id` plus time is what nearly every read filters on.
```
#### What This Part Should Cover
```premium-lock What This Part Should Cover
```
### Part 2 — Low-latency, per-account query path
Design the **read path** that powers the user-facing UI: a user queries the audit log for their own account, filtered by time range and optionally by actor/action/resource, paginated. Explain how you keep p99 latency low and how you enforce tenant isolation.
```hint Index for the access pattern
Make `account_id` the partition key and `timestamp` the clustering/sort order so a single account's recent events are contiguous and a time-range scan is a sequential read, not a full-table filter.
```
```hint Pagination and hot data
Use keyset (cursor) pagination on `(timestamp, event_id)` rather than `OFFSET`, which degrades on deep pages. Keep recent data on the fastest tier and consider a cache for the common "last 24h" view.
```
#### Clarifying Questions for this Part
- Which secondary filters (actor/action/resource) are common enough to deserve dedicated indexes versus being applied as post-filters on a time-scan?
- Is strict read-after-write needed for the requesting user's own just-performed action, or is a short delay acceptable here too?
#### What This Part Should Cover
```premium-lock What This Part Should Cover
```
### Part 3 — Scaling, retention, and reliability
Explain how the system scales with event volume and tenant count, how you handle skew from very large tenants, how you manage long retention with tiering, and how you keep the audit trail durable and tamper-evident.
```hint Scale and skew
Sharding/partitioning by `account_id` distributes load, but a few "whale" tenants create hot partitions — discuss sub-partitioning a hot tenant by time bucket or hashing, and time-based partitions you can drop/roll cheaply.
```
```hint Retention and tiering
Time-bucketed partitions let you age data: hot recent partitions on fast storage, older partitions rolled to cheap object storage / a columnar archive that's still queryable but slower. Retention enforcement becomes "drop the old partition."
```
#### What This Part Should Cover
```premium-lock What This Part Should Cover
```
### What a Strong Answer Covers
```premium-lock What a Strong Answer Covers
```
### Follow-up Questions
- A single "whale" tenant generates 10x the volume of all others combined and its partition is now a hotspot. How do you rebalance without downtime or breaking that tenant's time-ordered queries?
- A compliance requirement says audit events must be provably unmodified since write. How do you make the log tamper-evident, and how does that interact with your storage and ingest design?
- Product now wants free-text search and aggregation (e.g., "all failed logins by IP this week") over the audit data. How does that change your storage/indexing, and would you introduce a separate system for it?
- A downstream consumer was offline for 6 hours during a burst. How does your ingest design let it catch up without losing events or overwhelming storage?
Quick Answer: This question evaluates the ability to design a scalable, multi-tenant audit logging system, covering event storage, tamper-evident retention, and efficient tenant-scoped querying. It is commonly asked in system design interviews to test how well a candidate balances high-throughput ingestion with low-latency reads under strict data durability and compliance requirements.