How do I approach System Design interview questions?

System Design questions require understanding of core concepts and practice. PracHub provides solutions with explanations to help you master system design interviews.

What difficulty level is this interview question?

This is a medium difficulty System Design question, commonly asked during Technical Screen rounds at Snowflake.

What role is this question designed for?

This question is commonly asked for Software Engineer candidates at Snowflake during technical interviews.

Design an Audit Logs Service | Snowflake Interview Question

Q: Design an Audit Logs Service

This question evaluates the ability to design a scalable, multi-tenant audit logging system, covering event storage, tamper-evident retention, and efficient tenant-scoped querying. It is commonly asked in system design interviews to test how well a candidate balances high-throughput ingestion with low-latency reads under strict data durability and compliance requirements.

Design an Audit Logs Service

Design a backend service that records audit log events for a multi-tenant SaaS platform and lets each account's users query the audit history for their own account.

An audit event captures "who did what, to which resource, when, and from where" — for example, a user signed in, changed a permission, deleted a record, rotated an API key, or exported data. These events are written by many internal services across the platform and are read back by end users (and their admins) through a UI that supports filtering and pagination. Audit logs are also relied on for security investigations and compliance, so they must be durable, tamper-evident, and queryable for a long retention window.

The interviewer is specifically interested in three things: how you store the event/transaction log, how the system scales as event volume and tenant count grow, and how you keep query latency low for the user-facing read path.

Constraints & Assumptions

Multi-tenant: events always belong to exactly one account_id (tenant). Cross-tenant reads are never allowed.
Write volume: assume a high, bursty ingest rate (e.g., on the order of 50k–100k events/sec at peak across all tenants), heavily skewed toward a few large tenants.
Read pattern: user-facing queries are scoped to a single account, usually filtered by time range and optionally by actor, action type, or resource; results are paginated (e.g., 50 per page) and typically hit recent data.
Retention: events must be retained and queryable for a long window (e.g., 1–2 years), with much older data allowed to move to colder, slower storage.
Latency target: p99 for an interactive query page is on the order of a few hundred milliseconds; ingestion can be asynchronous and is allowed to be eventually consistent (a few seconds of write-to-readable lag is acceptable).
Durability/integrity: an accepted event must not be silently lost, and the audit trail should be append-only and resistant to tampering.

Clarifying Questions to Ask

What is the acceptable write-to-read lag — must a user see their own action immediately, or is a few seconds of delay fine?
Are events immutable and append-only, or can they be edited/deleted (e.g., for legal "right to be forgotten" requests)?
What query dimensions must be supported (time range only, or also actor / action / resource / IP), and is free-text search required?
What are the retention and compliance requirements (duration, immutability guarantees, export, who can read)?
What is the tenant size distribution — do we need to handle a few "whale" tenants that dwarf everyone else?
What read QPS and per-query result-size limits should we plan for, and is the query strictly per-account or are there cross-account admin/security views?

Part 1 — Ingestion and storage of the event log

Define the audit event schema and design the write path: how producing services submit events, how you guarantee accepted events are durably stored without adding latency or coupling to the producers, and what the primary storage model is for the log.

What This Part Should Cover Premium

Part 2 — Low-latency, per-account query path

Design the read path that powers the user-facing UI: a user queries the audit log for their own account, filtered by time range and optionally by actor/action/resource, paginated. Explain how you keep p99 latency low and how you enforce tenant isolation.

Clarifying Questions for this Part

Which secondary filters (actor/action/resource) are common enough to deserve dedicated indexes versus being applied as post-filters on a time-scan?
Is strict read-after-write needed for the requesting user's own just-performed action, or is a short delay acceptable here too?

What This Part Should Cover Premium

Part 3 — Scaling, retention, and reliability

Explain how the system scales with event volume and tenant count, how you handle skew from very large tenants, how you manage long retention with tiering, and how you keep the audit trail durable and tamper-evident.

What This Part Should Cover Premium

What a Strong Answer Covers Premium

Follow-up Questions

A single "whale" tenant generates 10x the volume of all others combined and its partition is now a hotspot. How do you rebalance without downtime or breaking that tenant's time-ordered queries?
A compliance requirement says audit events must be provably unmodified since write. How do you make the log tamper-evident, and how does that interact with your storage and ingest design?
Product now wants free-text search and aggregation (e.g., "all failed logins by IP this week") over the audit data. How does that change your storage/indexing, and would you introduce a separate system for it?
A downstream consumer was offline for 6 hours during a burst. How does your ingest design let it catch up without losing events or overwhelming storage?

Design an Audit Logs Service

Design a backend service that records audit log events for a multi-tenant SaaS platform and lets each account's users query the audit history for their own account.

Constraints & Assumptions

Multi-tenant: events always belong to exactly one account_id (tenant). Cross-tenant reads are never allowed.
Write volume: assume a high, bursty ingest rate (e.g., on the order of 50k–100k events/sec at peak across all tenants), heavily skewed toward a few large tenants.
Read pattern: user-facing queries are scoped to a single account, usually filtered by time range and optionally by actor, action type, or resource; results are paginated (e.g., 50 per page) and typically hit recent data.
Retention: events must be retained and queryable for a long window (e.g., 1–2 years), with much older data allowed to move to colder, slower storage.
Latency target: p99 for an interactive query page is on the order of a few hundred milliseconds; ingestion can be asynchronous and is allowed to be eventually consistent (a few seconds of write-to-readable lag is acceptable).
Durability/integrity: an accepted event must not be silently lost, and the audit trail should be append-only and resistant to tampering.

Clarifying Questions to Ask

What is the acceptable write-to-read lag — must a user see their own action immediately, or is a few seconds of delay fine?
Are events immutable and append-only, or can they be edited/deleted (e.g., for legal "right to be forgotten" requests)?
What query dimensions must be supported (time range only, or also actor / action / resource / IP), and is free-text search required?
What are the retention and compliance requirements (duration, immutability guarantees, export, who can read)?
What is the tenant size distribution — do we need to handle a few "whale" tenants that dwarf everyone else?
What read QPS and per-query result-size limits should we plan for, and is the query strictly per-account or are there cross-account admin/security views?

Part 1 — Ingestion and storage of the event log

What This Part Should Cover Premium

Part 2 — Low-latency, per-account query path

Clarifying Questions for this Part

Which secondary filters (actor/action/resource) are common enough to deserve dedicated indexes versus being applied as post-filters on a time-scan?
Is strict read-after-write needed for the requesting user's own just-performed action, or is a short delay acceptable here too?

What This Part Should Cover Premium

Part 3 — Scaling, retention, and reliability

What This Part Should Cover Premium

What a Strong Answer Covers Premium

Follow-up Questions

A single "whale" tenant generates 10x the volume of all others combined and its partition is now a hotspot. How do you rebalance without downtime or breaking that tenant's time-ordered queries?
A compliance requirement says audit events must be provably unmodified since write. How do you make the log tamper-evident, and how does that interact with your storage and ingest design?
Product now wants free-text search and aggregation (e.g., "all failed logins by IP this week") over the audit data. How does that change your storage/indexing, and would you introduce a separate system for it?
A downstream consumer was offline for 6 hours during a burst. How does your ingest design let it catch up without losing events or overwhelming storage?

Design an Audit Logs Service

Quick Overview

Design an Audit Logs Service

Design an Audit Logs Service

Constraints & Assumptions

Clarifying Questions to Ask

Part 1 — Ingestion and storage of the event log

What This Part Should Cover Premium

Part 2 — Low-latency, per-account query path

Clarifying Questions for this Part

What This Part Should Cover Premium

Part 3 — Scaling, retention, and reliability

What This Part Should Cover Premium

What a Strong Answer Covers Premium

Follow-up Questions

Submit Your Answer to Earn 20XP

Design an Audit Logs Service

Quick Overview

Design an Audit Logs Service

Design an Audit Logs Service

Constraints & Assumptions

Clarifying Questions to Ask

Part 1 — Ingestion and storage of the event log

What This Part Should Cover Premium

Part 2 — Low-latency, per-account query path

Clarifying Questions for this Part

What This Part Should Cover Premium

Part 3 — Scaling, retention, and reliability

What This Part Should Cover Premium

What a Strong Answer Covers Premium

Follow-up Questions

Submit Your Answer to Earn 20XP