PracHub
QuestionsCoachesLearningGuidesInterview Prep
|Home/System Design/Snowflake

Design an Audit Logs Service

Last updated: Jul 1, 2026

Quick Overview

This question evaluates the ability to design a scalable, multi-tenant audit logging system, covering event storage, tamper-evident retention, and efficient tenant-scoped querying. It is commonly asked in system design interviews to test how well a candidate balances high-throughput ingestion with low-latency reads under strict data durability and compliance requirements.

  • medium
  • Snowflake
  • System Design
  • Software Engineer

Design an Audit Logs Service

Company: Snowflake

Role: Software Engineer

Category: System Design

Difficulty: medium

Interview Round: Technical Screen

# Design an Audit Logs Service Design a backend service that records **audit log events** for a multi-tenant SaaS platform and lets each account's users **query the audit history for their own account**. An audit event captures "who did what, to which resource, when, and from where" — for example, a user signed in, changed a permission, deleted a record, rotated an API key, or exported data. These events are written by many internal services across the platform and are read back by end users (and their admins) through a UI that supports filtering and pagination. Audit logs are also relied on for security investigations and compliance, so they must be durable, tamper-evident, and queryable for a long retention window. The interviewer is specifically interested in three things: **how you store the event/transaction log**, **how the system scales** as event volume and tenant count grow, and **how you keep query latency low** for the user-facing read path. ### Constraints & Assumptions - Multi-tenant: events always belong to exactly one `account_id` (tenant). Cross-tenant reads are never allowed. - Write volume: assume a high, bursty ingest rate (e.g., on the order of 50k–100k events/sec at peak across all tenants), heavily skewed toward a few large tenants. - Read pattern: user-facing queries are scoped to a single account, usually filtered by time range and optionally by actor, action type, or resource; results are paginated (e.g., 50 per page) and typically hit recent data. - Retention: events must be retained and queryable for a long window (e.g., 1–2 years), with much older data allowed to move to colder, slower storage. - Latency target: p99 for an interactive query page is on the order of a few hundred milliseconds; ingestion can be asynchronous and is allowed to be eventually consistent (a few seconds of write-to-readable lag is acceptable). - Durability/integrity: an accepted event must not be silently lost, and the audit trail should be append-only and resistant to tampering. ### Clarifying Questions to Ask - What is the acceptable write-to-read lag — must a user see their own action immediately, or is a few seconds of delay fine? - Are events immutable and append-only, or can they be edited/deleted (e.g., for legal "right to be forgotten" requests)? - What query dimensions must be supported (time range only, or also actor / action / resource / IP), and is free-text search required? - What are the retention and compliance requirements (duration, immutability guarantees, export, who can read)? - What is the tenant size distribution — do we need to handle a few "whale" tenants that dwarf everyone else? - What read QPS and per-query result-size limits should we plan for, and is the query strictly per-account or are there cross-account admin/security views? ### Part 1 — Ingestion and storage of the event log Define the audit event schema and design the **write path**: how producing services submit events, how you guarantee accepted events are durably stored without adding latency or coupling to the producers, and what the primary storage model is for the log. ```hint Where to start Treat the producers and the durable store as decoupled. A durable, partitioned append-only log (e.g., a Kafka-style topic) in front of storage absorbs bursts, decouples producers from the database, and gives you replay for free. ``` ```hint Schema and partitioning Model an event as an immutable row: `event_id`, `account_id`, `actor_id`, `action`, `resource_type`, `resource_id`, `timestamp`, `ip`, `metadata` (JSON). Think hard about the partition/sort key — `account_id` plus time is what nearly every read filters on. ``` #### What This Part Should Cover ```premium-lock What This Part Should Cover ``` ### Part 2 — Low-latency, per-account query path Design the **read path** that powers the user-facing UI: a user queries the audit log for their own account, filtered by time range and optionally by actor/action/resource, paginated. Explain how you keep p99 latency low and how you enforce tenant isolation. ```hint Index for the access pattern Make `account_id` the partition key and `timestamp` the clustering/sort order so a single account's recent events are contiguous and a time-range scan is a sequential read, not a full-table filter. ``` ```hint Pagination and hot data Use keyset (cursor) pagination on `(timestamp, event_id)` rather than `OFFSET`, which degrades on deep pages. Keep recent data on the fastest tier and consider a cache for the common "last 24h" view. ``` #### Clarifying Questions for this Part - Which secondary filters (actor/action/resource) are common enough to deserve dedicated indexes versus being applied as post-filters on a time-scan? - Is strict read-after-write needed for the requesting user's own just-performed action, or is a short delay acceptable here too? #### What This Part Should Cover ```premium-lock What This Part Should Cover ``` ### Part 3 — Scaling, retention, and reliability Explain how the system scales with event volume and tenant count, how you handle skew from very large tenants, how you manage long retention with tiering, and how you keep the audit trail durable and tamper-evident. ```hint Scale and skew Sharding/partitioning by `account_id` distributes load, but a few "whale" tenants create hot partitions — discuss sub-partitioning a hot tenant by time bucket or hashing, and time-based partitions you can drop/roll cheaply. ``` ```hint Retention and tiering Time-bucketed partitions let you age data: hot recent partitions on fast storage, older partitions rolled to cheap object storage / a columnar archive that's still queryable but slower. Retention enforcement becomes "drop the old partition." ``` #### What This Part Should Cover ```premium-lock What This Part Should Cover ``` ### What a Strong Answer Covers ```premium-lock What a Strong Answer Covers ``` ### Follow-up Questions - A single "whale" tenant generates 10x the volume of all others combined and its partition is now a hotspot. How do you rebalance without downtime or breaking that tenant's time-ordered queries? - A compliance requirement says audit events must be provably unmodified since write. How do you make the log tamper-evident, and how does that interact with your storage and ingest design? - Product now wants free-text search and aggregation (e.g., "all failed logins by IP this week") over the audit data. How does that change your storage/indexing, and would you introduce a separate system for it? - A downstream consumer was offline for 6 hours during a burst. How does your ingest design let it catch up without losing events or overwhelming storage?

Quick Answer: This question evaluates the ability to design a scalable, multi-tenant audit logging system, covering event storage, tamper-evident retention, and efficient tenant-scoped querying. It is commonly asked in system design interviews to test how well a candidate balances high-throughput ingestion with low-latency reads under strict data durability and compliance requirements.

Related Interview Questions

  • Design an Automated Jira-Ticket-to-PR System - Snowflake (hard)
  • Design a Cron Job Scheduler - Snowflake (medium)
  • Design a REST API Abstraction Layer - Snowflake (hard)
  • Design a disk-backed KV store under contention - Snowflake (easy)
  • Design an ACL authorization checking service - Snowflake (hard)
|Home/System Design/Snowflake

Design an Audit Logs Service

Snowflake logo
Snowflake
Jun 17, 2026, 12:00 AM
mediumSoftware EngineerTechnical ScreenSystem Design
0
0

Design an Audit Logs Service

Design a backend service that records audit log events for a multi-tenant SaaS platform and lets each account's users query the audit history for their own account.

An audit event captures "who did what, to which resource, when, and from where" — for example, a user signed in, changed a permission, deleted a record, rotated an API key, or exported data. These events are written by many internal services across the platform and are read back by end users (and their admins) through a UI that supports filtering and pagination. Audit logs are also relied on for security investigations and compliance, so they must be durable, tamper-evident, and queryable for a long retention window.

The interviewer is specifically interested in three things: how you store the event/transaction log, how the system scales as event volume and tenant count grow, and how you keep query latency low for the user-facing read path.

Constraints & Assumptions

  • Multi-tenant: events always belong to exactly one account_id (tenant). Cross-tenant reads are never allowed.
  • Write volume: assume a high, bursty ingest rate (e.g., on the order of 50k–100k events/sec at peak across all tenants), heavily skewed toward a few large tenants.
  • Read pattern: user-facing queries are scoped to a single account, usually filtered by time range and optionally by actor, action type, or resource; results are paginated (e.g., 50 per page) and typically hit recent data.
  • Retention: events must be retained and queryable for a long window (e.g., 1–2 years), with much older data allowed to move to colder, slower storage.
  • Latency target: p99 for an interactive query page is on the order of a few hundred milliseconds; ingestion can be asynchronous and is allowed to be eventually consistent (a few seconds of write-to-readable lag is acceptable).
  • Durability/integrity: an accepted event must not be silently lost, and the audit trail should be append-only and resistant to tampering.

Clarifying Questions to Ask

  • What is the acceptable write-to-read lag — must a user see their own action immediately, or is a few seconds of delay fine?
  • Are events immutable and append-only, or can they be edited/deleted (e.g., for legal "right to be forgotten" requests)?
  • What query dimensions must be supported (time range only, or also actor / action / resource / IP), and is free-text search required?
  • What are the retention and compliance requirements (duration, immutability guarantees, export, who can read)?
  • What is the tenant size distribution — do we need to handle a few "whale" tenants that dwarf everyone else?
  • What read QPS and per-query result-size limits should we plan for, and is the query strictly per-account or are there cross-account admin/security views?

Part 1 — Ingestion and storage of the event log

Define the audit event schema and design the write path: how producing services submit events, how you guarantee accepted events are durably stored without adding latency or coupling to the producers, and what the primary storage model is for the log.

What This Part Should Cover Premium

Part 2 — Low-latency, per-account query path

Design the read path that powers the user-facing UI: a user queries the audit log for their own account, filtered by time range and optionally by actor/action/resource, paginated. Explain how you keep p99 latency low and how you enforce tenant isolation.

Clarifying Questions for this Part

  • Which secondary filters (actor/action/resource) are common enough to deserve dedicated indexes versus being applied as post-filters on a time-scan?
  • Is strict read-after-write needed for the requesting user's own just-performed action, or is a short delay acceptable here too?

What This Part Should Cover Premium

Part 3 — Scaling, retention, and reliability

Explain how the system scales with event volume and tenant count, how you handle skew from very large tenants, how you manage long retention with tiering, and how you keep the audit trail durable and tamper-evident.

What This Part Should Cover Premium

What a Strong Answer Covers Premium

Follow-up Questions

  • A single "whale" tenant generates 10x the volume of all others combined and its partition is now a hotspot. How do you rebalance without downtime or breaking that tenant's time-ordered queries?
  • A compliance requirement says audit events must be provably unmodified since write. How do you make the log tamper-evident, and how does that interact with your storage and ingest design?
  • Product now wants free-text search and aggregation (e.g., "all failed logins by IP this week") over the audit data. How does that change your storage/indexing, and would you introduce a separate system for it?
  • A downstream consumer was offline for 6 hours during a burst. How does your ingest design let it catch up without losing events or overwhelming storage?

Submit Your Answer to Earn 20XP

Sign in to leave a comment

Loading comments...

Browse More Questions

More System Design•More Snowflake•More Software Engineer•Snowflake Software Engineer•Snowflake System Design•Software Engineer System Design

Your design canvas — auto-saved

PracHub

Master your tech interviews with 8,000+ real questions from top companies.

Product

  • Questions
  • Learning Tracks
  • Interview Guides
  • Resources
  • Premium
  • For Universities
  • Student Access

Browse

  • By Company
  • By Role
  • By Category
  • Topic Hubs
  • SQL Questions
  • AI Coding Questions
  • Compare Platforms
  • Discord Community

Support

  • support@prachub.com
  • (916) 541-4762

Legal

  • Privacy Policy
  • Terms of Service
  • About Us

© 2026 PracHub. All rights reserved.