PracHub
QuestionsPremiumCoachesLearningGuidesInterview Prep
|Home/Coding & Algorithms/Pinterest

Design a violation log analyzer

Last updated: Apr 10, 2026

Quick Overview

This question evaluates proficiency in data structures, in-memory indexing and query design, algorithmic time/space complexity analysis, and reasoning about state management for dynamic event streams.

  • Medium
  • Pinterest
  • Coding & Algorithms
  • Software Engineer

Design a violation log analyzer

Company: Pinterest

Role: Software Engineer

Category: Coding & Algorithms

Difficulty: Medium

Interview Round: Technical Screen

You are given an append-only list of violation events as tuples (id: string, policy: string, date: ISO-8601 string). Build an in-memory "Violation Log Analyzer" that supports: 1) Given an id, return all policies that this id violated. 2) Given a policy, return all ids that violated this policy. 3) Given a date, return all ids that violated any policy on that exact date. Start with a straightforward approach, then: - Choose data structures and analyze the time/space of building the index and answering each query. - Optimize using appropriate inverted indexes (e.g., id→policies, policy→ids, date→ids). Show core APIs (build(log), query_by_id(id), query_by_policy(policy), query_by_date(date)) and pseudocode. - If there will be many repeated queries, propose caching or precomputation strategies (e.g., memoization of frequent queries, materialized sets), and discuss invalidation when new events arrive. - If the event list is sorted by date, explain how to use binary search to locate all events for a target date efficiently and return the corresponding ids; provide the algorithm and complexity. - Discuss trade-offs between query latency and memory, and how your design scales with number of events, distinct ids, and distinct policies.

Quick Answer: This question evaluates proficiency in data structures, in-memory indexing and query design, algorithmic time/space complexity analysis, and reasoning about state management for dynamic event streams.

Solution

### Problem restated (briefly) We ingest an append-only stream of violation events, each a tuple `(id, policy, date)` where `id` and `policy` are strings and `date` is an ISO-8601 day (e.g. `"2026-06-16"`). We must answer three lookups quickly: 1. `query_by_id(id)` → set of policies this id violated. 2. `query_by_policy(policy)` → set of ids that violated this policy. 3. `query_by_date(date)` → set of ids with any violation on that exact date. The classic interview arc here is: **start with the naive scan, then build inverted indexes, then add caching, then exploit sortedness with binary search, then discuss trade-offs and scaling.** I'll walk that arc. --- ### Step 0 — Naive baseline (the "straightforward approach") Keep the raw list `events: List[(id, policy, date)]` and answer every query with a full linear scan. ```python def query_by_id(events, target_id): return {policy for (eid, policy, date) in events if eid == target_id} def query_by_policy(events, target_policy): return {eid for (eid, policy, date) in events if policy == target_policy} def query_by_date(events, target_date): return {eid for (eid, policy, date) in events if date == target_date} ``` - **Build:** $O(1)$ (just store the list). - **Each query:** $O(N)$ time, where $N$ = number of events. - **Space:** $O(N)$ for the raw events. This is fine for a one-shot batch question but unacceptable if we answer many queries — every lookup re-reads the whole log. The fix is to pay the build cost once and amortize it across queries. --- ### Step 1 — Inverted indexes (the core optimization) Build three hash maps, each mapping a key to a **set** of values. We use sets (not lists) so repeated `(id, policy)` pairs or multiple violations on the same date collapse to distinct results automatically, matching the "all policies / all ids" semantics. ``` id_to_policies : id -> set(policy) policy_to_ids : policy -> set(id) date_to_ids : date -> set(id) ``` #### Core APIs ```python from collections import defaultdict class ViolationLogAnalyzer: def __init__(self): self.id_to_policies = defaultdict(set) self.policy_to_ids = defaultdict(set) self.date_to_ids = defaultdict(set) def add(self, eid, policy, date): # Single-event ingestion (append-only stream). self.id_to_policies[eid].add(policy) self.policy_to_ids[policy].add(eid) self.date_to_ids[date].add(eid) def build(self, log): # Bulk build from an existing list of events. for eid, policy, date in log: self.add(eid, policy, date) def query_by_id(self, eid): # Return a defensive copy so callers can't mutate the live index. return set(self.id_to_policies.get(eid, ())) def query_by_policy(self, policy): return set(self.policy_to_ids.get(policy, ())) def query_by_date(self, date): return set(self.date_to_ids.get(date, ())) ``` The three queries copy the internal set on the way out (`set(...)`), so a caller can never corrupt the index by mutating the result. The cost is $O(K)$ per query, which dominates the $O(1)$ hash lookup anyway. If profiling shows the copy is the bottleneck for a trusted, read-only caller, swap to returning the live set directly (and document that the result must not be mutated) or hand back a `frozenset`. #### Complexity Let $N$ = number of events, and let $K_x$ = size of the answer set for query $x$. | Operation | Time | Notes | |---|---|---| | `add` (one event) | $O(1)$ amortized | three hash-set inserts | | `build(log)` | $O(N)$ | one pass | | `query_by_id` | $O(1)$ lookup + $O(K)$ copy | hash lookup, then defensive copy | | `query_by_policy` | $O(1)$ + $O(K)$ | same | | `query_by_date` | $O(1)$ + $O(K)$ | same | **Subtlety to call out in the interview:** the hash lookup is $O(1)$, but returning the *internal* set by reference exposes mutable state — an accidental `result.add(...)` by the caller would silently corrupt the index. The code above defaults to a defensive copy (the $O(K)$ term), which is the safe choice. If the caller is trusted and read-only, you can return the live set directly (or a `frozenset` view) to drop the copy and make the query $O(1)$ — a deliberate latency-vs-safety trade-off worth naming aloud. - **Space:** $O(N)$ overall. More precisely, `id_to_policies` and `policy_to_ids` together store at most one entry per distinct `(id, policy)` pair, and `date_to_ids` stores at most one entry per distinct `(date, id)` pair — both bounded by $N$ (deduped), so it's $O(N)$, not $O(N^2)$. Worst case it's roughly $2$–$3\times$ the raw log because each event contributes to (up to) three maps. This is the right baseline answer: $O(N)$ build, $O(1)$ amortized incremental insert, near-$O(1)$ queries. --- ### Step 2 — Many repeated queries: caching & precomputation The inverted index already makes a single query $O(1)$ to locate the answer set, so the win from caching is mostly about **avoiding the $O(K)$ copy/serialization cost** for hot keys, or memoizing **derived/composite results** that are expensive to recompute (e.g. set intersections like "ids that violated policy P *on* date D"). Strategies, cheapest first: - **Return cached references / materialized sets.** The maps *are* materialized sets, so the *lookup* in `query_by_id` is already $O(1)$; the only residual cost is the $O(K)$ defensive copy (or whatever post-processing the caller needs). Cache the post-processed form (e.g. a sorted list, a JSON blob, or an immutable `frozenset` view) when the caller repeatedly needs that exact shape, so you skip re-copying a hot key on every call. - **Memoize composite queries.** If hot queries are intersections/unions (policy ∧ date), keep an explicit cache keyed by the query tuple. A plain dict you control is cleaner here than `functools.lru_cache`, because writes must be able to invalidate individual entries (and `lru_cache` on an instance method would also key on `self`, which leaks across instances): ```python def __init__(self): ... self.composite_cache = {} # (policy, date) -> frozenset(ids) def ids_for_policy_and_date(self, policy, date): key = (policy, date) if key not in self.composite_cache: ids = self.policy_to_ids.get(policy, set()) & self.date_to_ids.get(date, set()) self.composite_cache[key] = frozenset(ids) return self.composite_cache[key] ``` On `add(eid, policy, date)`, drop any cached entry whose policy or date just changed (see the invalidation note below) — e.g. `self.composite_cache.pop((policy, date), None)`. - **Precompute frequency-driven materialized views.** Track query counts; for the top-$M$ hottest keys, precompute and store the finished result (sorted list, count, etc.). This trades memory for tail-latency stability. - **Cache negative results too** (empty answers) so repeated misses on unknown ids/policies stay $O(1)$ instead of re-deriving emptiness. #### Invalidation when new events arrive This is the crux the interviewer is probing. The log is **append-only**, which makes invalidation tractable: - **Targeted invalidation (preferred).** An incoming event `(id, policy, date)` can only *grow* exactly three answer sets: `id_to_policies[id]`, `policy_to_ids[policy]`, `date_to_ids[date]`. So on `add`, invalidate only cache entries keyed by that `id`, that `policy`, and that `date` (plus any composite caches touching them). Everything else stays warm. This is cheap and surgical. - **Versioning / generation counters.** Keep a global monotonically increasing `version`. Stamp each cache entry with the version at compute time; on read, treat an entry as stale if a finer-grained per-key version has advanced. This avoids walking a large cache on every write. - **Append-friendly property.** Because we never delete events, cached sets are *monotone* — they only ever get supersets. So for additive caches you can update in place (`cache[id].add(policy)`) instead of evicting, which is even cheaper than invalidation. Full eviction is only needed for *derived* values (counts, sorted lists, intersections) where adding one element can't be patched in O(1) for every shape. - **Write-heavy caveat:** if writes vastly outnumber reads, skip caching entirely — the index lookups are already $O(1)$ and cache maintenance becomes pure overhead. --- ### Step 3 — If events are sorted by date: binary search Suppose instead of (or in addition to) the hash index we keep events in an array **sorted by date**. Then `query_by_date(D)` becomes a contiguous-range problem: all events for `D` form one contiguous block, and we can find its boundaries with binary search — no full index needed for the date query, which saves the `date_to_ids` map's memory. Use the standard lower-bound / upper-bound pair (`bisect_left` / `bisect_right`): ```python import bisect # events sorted ascending by date; keys = [e.date for e in events] def query_by_date_sorted(events, keys, target_date): lo = bisect.bisect_left(keys, target_date) # first index with date >= target hi = bisect.bisect_right(keys, target_date) # first index with date > target return {events[i].id for i in range(lo, hi)} # the [lo, hi) block is exactly target_date ``` - `bisect_left` returns the first position where `target_date` could be inserted keeping order → the start of the equal-date block. - `bisect_right` returns the position just past the last equal element → the end (exclusive). - The half-open range `[lo, hi)` is **exactly** the set of events on `target_date`. #### Complexity - **Locating the block:** $O(\log N)$ for the two binary searches. - **Collecting ids:** $O(K)$ where $K = hi - lo$ is the number of events on that date. - **Total:** $O(\log N + K)$, which is optimal — you must touch all $K$ results to return them. - **Space:** $O(N)$ for the sorted array; **no `date_to_ids` map needed**, so this is strictly leaner if dates dominate memory. **Practical notes:** - ISO-8601 date strings sort lexicographically the same as chronologically (`"2026-01-31" < "2026-02-01"` as strings), so plain string comparison works as the sort key — no date parsing required. (This holds only for zero-padded, same-precision ISO dates; mixed formats break it.) - Append-only + sorted-by-date is a happy case: if events arrive in date order (which a real-time log roughly does), appends keep the array sorted with no re-sort. If they can arrive out of order, you'd need an insertion (`O(N)` shift) or a different structure (e.g. a balanced BST / B-tree keyed by date), at which point the hash `date_to_ids` map is usually simpler and gives $O(1)$ date lookups. - Binary search only helps `query_by_date` (the sorted key). `query_by_id` and `query_by_policy` still want their inverted hash maps; sorting by date does nothing for them. --- ### Step 4 — Trade-offs: latency vs. memory, and scaling Let $N$ = events, $I$ = distinct ids, $P$ = distinct policies, $D$ = distinct dates. | Design | Build | per-query | Extra memory | When to use | |---|---|---|---|---| | Naive scan | $O(1)$ | $O(N)$ | $O(N)$ | tiny logs / one-shot | | Hash inverted indexes | $O(N)$ | $O(1)$+$O(K)$ | $\sim 2$–$3\times N$ | many mixed queries (default) | | Sorted array + binary search (dates) | $O(N \log N)$ to sort | $O(\log N + K)$ date; others slow | $O(N)$ (no date map) | date queries dominate, memory-tight | | Hybrid (hash for id/policy, sorted for date) | $O(N \log N)$ | $O(1)$ id/policy, $O(\log N + K)$ date | between the two | balanced, memory-conscious | Key trade-offs: - **Latency vs. memory.** The inverted indexes buy $O(1)$ queries by spending memory on three copies of the key relationships. Dropping `date_to_ids` in favor of a sorted array trades a bit of date-query latency ($O(1) \to O(\log N + K)$) for one fewer map — meaningful when $D$ is large or memory is the binding constraint. - **Read-heavy vs. write-heavy.** Read-heavy → maximize precomputation/caching. Write-heavy (the append stream is busy) → keep the per-`add` cost minimal (the three $O(1)$ inserts) and avoid caches that need constant invalidation. - **Result-set size $K$.** No design beats $O(K)$ to *return* a large answer; the index only removes the $O(N)$ *search*. If callers only need a **count** or **existence**, store cached cardinalities (`len`) and skip materializing the set — that's where precomputation pays off most. Scaling behavior: - **Number of events $N$:** memory grows linearly; query time stays flat with the hash index. The list/array form degrades linearly per query, which is why we index. - **Distinct ids $I$ / policies $P$:** these bound the *number of keys* and the *fan-out* of each set. A policy violated by everyone gives a huge `policy_to_ids[policy]` set — $O(K)$ to return is unavoidable; consider pagination/streaming for such hot keys. - **Beyond one machine:** if $N$ outgrows memory, partition (shard) by a key — e.g. shard `id_to_policies`/`policy_to_ids` by hash of the key, and shard `date_to_ids` by date range (which also keeps each shard naturally sorted for binary search). Old dates can be tiered to disk/columnar storage while recent dates stay hot in memory. At that scale this becomes a search-index / inverted-index system (the same shape as Lucene/Elasticsearch postings lists), and the in-memory version above is exactly the single-node core of it. --- ### Edge cases & correctness checklist - **Unknown key:** return an empty set, not an error (`.get(key, set())`). - **Duplicate events** `(id, policy, date)` repeated: sets dedupe automatically; counts (if needed) require a separate multiset/counter. - **Same id, same policy, different dates:** `id_to_policies[id]` still holds `policy` once (correct — "policies violated" is a set), while `date_to_ids` records the id under each date. - **Empty log:** all queries return empty sets; build is a no-op. - **Mutation safety:** never hand out the live internal set unless documented read-only; prefer returning a copy/`frozenset`. - **Time-range queries** (a likely follow-up): the sorted-by-date array generalizes for free — `bisect_left(start)` to `bisect_right(end)` gives an inclusive `[start, end]` range in $O(\log N + K)$, which the hash `date_to_ids` map cannot do without iterating every date in range.

Related Interview Questions

  • Maximize Boxes Stored Through One Entrance - Pinterest (medium)
  • Solve Multiple Coding Interview Problems - Pinterest (medium)
  • Implement a Sparse Matrix Class - Pinterest (medium)
  • Assign Pins to Shortest Columns - Pinterest (medium)
  • Design Hierarchical Permission Checks - Pinterest (medium)
Pinterest logo
Pinterest
Sep 6, 2025, 12:00 AM
Software Engineer
Technical Screen
Coding & Algorithms
37
0

You are given an append-only list of violation events as tuples (id: string, policy: string, date: ISO-8601 string). Build an in-memory "Violation Log Analyzer" that supports:

  1. Given an id, return all policies that this id violated.
  2. Given a policy, return all ids that violated this policy.
  3. Given a date, return all ids that violated any policy on that exact date. Start with a straightforward approach, then:
  • Choose data structures and analyze the time/space of building the index and answering each query.
  • Optimize using appropriate inverted indexes (e.g., id→policies, policy→ids, date→ids). Show core APIs (build(log), query_by_id(id), query_by_policy(policy), query_by_date(date)) and pseudocode.
  • If there will be many repeated queries, propose caching or precomputation strategies (e.g., memoization of frequent queries, materialized sets), and discuss invalidation when new events arrive.
  • If the event list is sorted by date, explain how to use binary search to locate all events for a target date efficiently and return the corresponding ids; provide the algorithm and complexity.
  • Discuss trade-offs between query latency and memory, and how your design scales with number of events, distinct ids, and distinct policies.

Solution

Show

Submit Your Answer to Earn 20XP

Sign in to leave a comment

Loading comments...

Browse More Questions

More Coding & Algorithms•More Pinterest•More Software Engineer•Pinterest Software Engineer•Pinterest Coding & Algorithms•Software Engineer Coding & Algorithms
PracHub

Master your tech interviews with 8,000+ real questions from top companies.

Product

  • Questions
  • Learning Tracks
  • Interview Guides
  • Resources
  • Premium
  • For Universities
  • Student Access

Browse

  • By Company
  • By Role
  • By Category
  • Topic Hubs
  • SQL Questions
  • Compare Platforms
  • Discord Community

Support

  • support@prachub.com
  • (916) 541-4762

Legal

  • Privacy Policy
  • Terms of Service
  • About Us

© 2026 PracHub. All rights reserved.