You are given an append-only list of violation events as tuples (id: string, policy: string, date: ISO-8601 string). Build an in-memory "Violation Log Analyzer" that supports:
1) Given an id, return all policies that this id violated.
2) Given a policy, return all ids that violated this policy.
3) Given a date, return all ids that violated any policy on that exact date.
Start with a straightforward approach, then:
- Choose data structures and analyze the time/space of building the index and answering each query.
- Optimize using appropriate inverted indexes (e.g., id→policies, policy→ids, date→ids). Show core APIs (build(log), query_by_id(id), query_by_policy(policy), query_by_date(date)) and pseudocode.
- If there will be many repeated queries, propose caching or precomputation strategies (e.g., memoization of frequent queries, materialized sets), and discuss invalidation when new events arrive.
- If the event list is sorted by date, explain how to use binary search to locate all events for a target date efficiently and return the corresponding ids; provide the algorithm and complexity.
- Discuss trade-offs between query latency and memory, and how your design scales with number of events, distinct ids, and distinct policies.
Quick Answer: This question evaluates proficiency in data structures, in-memory indexing and query design, algorithmic time/space complexity analysis, and reasoning about state management for dynamic event streams.
Solution
### Problem restated (briefly)
We ingest an append-only stream of violation events, each a tuple `(id, policy, date)` where `id` and `policy` are strings and `date` is an ISO-8601 day (e.g. `"2026-06-16"`). We must answer three lookups quickly:
1. `query_by_id(id)` → set of policies this id violated.
2. `query_by_policy(policy)` → set of ids that violated this policy.
3. `query_by_date(date)` → set of ids with any violation on that exact date.
The classic interview arc here is: **start with the naive scan, then build inverted indexes, then add caching, then exploit sortedness with binary search, then discuss trade-offs and scaling.** I'll walk that arc.
---
### Step 0 — Naive baseline (the "straightforward approach")
Keep the raw list `events: List[(id, policy, date)]` and answer every query with a full linear scan.
```python
def query_by_id(events, target_id):
return {policy for (eid, policy, date) in events if eid == target_id}
def query_by_policy(events, target_policy):
return {eid for (eid, policy, date) in events if policy == target_policy}
def query_by_date(events, target_date):
return {eid for (eid, policy, date) in events if date == target_date}
```
- **Build:** $O(1)$ (just store the list).
- **Each query:** $O(N)$ time, where $N$ = number of events.
- **Space:** $O(N)$ for the raw events.
This is fine for a one-shot batch question but unacceptable if we answer many queries — every lookup re-reads the whole log. The fix is to pay the build cost once and amortize it across queries.
---
### Step 1 — Inverted indexes (the core optimization)
Build three hash maps, each mapping a key to a **set** of values. We use sets (not lists) so repeated `(id, policy)` pairs or multiple violations on the same date collapse to distinct results automatically, matching the "all policies / all ids" semantics.
```
id_to_policies : id -> set(policy)
policy_to_ids : policy -> set(id)
date_to_ids : date -> set(id)
```
#### Core APIs
```python
from collections import defaultdict
class ViolationLogAnalyzer:
def __init__(self):
self.id_to_policies = defaultdict(set)
self.policy_to_ids = defaultdict(set)
self.date_to_ids = defaultdict(set)
def add(self, eid, policy, date):
# Single-event ingestion (append-only stream).
self.id_to_policies[eid].add(policy)
self.policy_to_ids[policy].add(eid)
self.date_to_ids[date].add(eid)
def build(self, log):
# Bulk build from an existing list of events.
for eid, policy, date in log:
self.add(eid, policy, date)
def query_by_id(self, eid):
# Return a defensive copy so callers can't mutate the live index.
return set(self.id_to_policies.get(eid, ()))
def query_by_policy(self, policy):
return set(self.policy_to_ids.get(policy, ()))
def query_by_date(self, date):
return set(self.date_to_ids.get(date, ()))
```
The three queries copy the internal set on the way out (`set(...)`), so a caller can never corrupt the index by mutating the result. The cost is $O(K)$ per query, which dominates the $O(1)$ hash lookup anyway. If profiling shows the copy is the bottleneck for a trusted, read-only caller, swap to returning the live set directly (and document that the result must not be mutated) or hand back a `frozenset`.
#### Complexity
Let $N$ = number of events, and let $K_x$ = size of the answer set for query $x$.
| Operation | Time | Notes |
|---|---|---|
| `add` (one event) | $O(1)$ amortized | three hash-set inserts |
| `build(log)` | $O(N)$ | one pass |
| `query_by_id` | $O(1)$ lookup + $O(K)$ copy | hash lookup, then defensive copy |
| `query_by_policy` | $O(1)$ + $O(K)$ | same |
| `query_by_date` | $O(1)$ + $O(K)$ | same |
**Subtlety to call out in the interview:** the hash lookup is $O(1)$, but returning the *internal* set by reference exposes mutable state — an accidental `result.add(...)` by the caller would silently corrupt the index. The code above defaults to a defensive copy (the $O(K)$ term), which is the safe choice. If the caller is trusted and read-only, you can return the live set directly (or a `frozenset` view) to drop the copy and make the query $O(1)$ — a deliberate latency-vs-safety trade-off worth naming aloud.
- **Space:** $O(N)$ overall. More precisely, `id_to_policies` and `policy_to_ids` together store at most one entry per distinct `(id, policy)` pair, and `date_to_ids` stores at most one entry per distinct `(date, id)` pair — both bounded by $N$ (deduped), so it's $O(N)$, not $O(N^2)$. Worst case it's roughly $2$–$3\times$ the raw log because each event contributes to (up to) three maps.
This is the right baseline answer: $O(N)$ build, $O(1)$ amortized incremental insert, near-$O(1)$ queries.
---
### Step 2 — Many repeated queries: caching & precomputation
The inverted index already makes a single query $O(1)$ to locate the answer set, so the win from caching is mostly about **avoiding the $O(K)$ copy/serialization cost** for hot keys, or memoizing **derived/composite results** that are expensive to recompute (e.g. set intersections like "ids that violated policy P *on* date D").
Strategies, cheapest first:
- **Return cached references / materialized sets.** The maps *are* materialized sets, so the *lookup* in `query_by_id` is already $O(1)$; the only residual cost is the $O(K)$ defensive copy (or whatever post-processing the caller needs). Cache the post-processed form (e.g. a sorted list, a JSON blob, or an immutable `frozenset` view) when the caller repeatedly needs that exact shape, so you skip re-copying a hot key on every call.
- **Memoize composite queries.** If hot queries are intersections/unions (policy ∧ date), keep an explicit cache keyed by the query tuple. A plain dict you control is cleaner here than `functools.lru_cache`, because writes must be able to invalidate individual entries (and `lru_cache` on an instance method would also key on `self`, which leaks across instances):
```python
def __init__(self):
...
self.composite_cache = {} # (policy, date) -> frozenset(ids)
def ids_for_policy_and_date(self, policy, date):
key = (policy, date)
if key not in self.composite_cache:
ids = self.policy_to_ids.get(policy, set()) & self.date_to_ids.get(date, set())
self.composite_cache[key] = frozenset(ids)
return self.composite_cache[key]
```
On `add(eid, policy, date)`, drop any cached entry whose policy or date just changed (see the invalidation note below) — e.g. `self.composite_cache.pop((policy, date), None)`.
- **Precompute frequency-driven materialized views.** Track query counts; for the top-$M$ hottest keys, precompute and store the finished result (sorted list, count, etc.). This trades memory for tail-latency stability.
- **Cache negative results too** (empty answers) so repeated misses on unknown ids/policies stay $O(1)$ instead of re-deriving emptiness.
#### Invalidation when new events arrive
This is the crux the interviewer is probing. The log is **append-only**, which makes invalidation tractable:
- **Targeted invalidation (preferred).** An incoming event `(id, policy, date)` can only *grow* exactly three answer sets: `id_to_policies[id]`, `policy_to_ids[policy]`, `date_to_ids[date]`. So on `add`, invalidate only cache entries keyed by that `id`, that `policy`, and that `date` (plus any composite caches touching them). Everything else stays warm. This is cheap and surgical.
- **Versioning / generation counters.** Keep a global monotonically increasing `version`. Stamp each cache entry with the version at compute time; on read, treat an entry as stale if a finer-grained per-key version has advanced. This avoids walking a large cache on every write.
- **Append-friendly property.** Because we never delete events, cached sets are *monotone* — they only ever get supersets. So for additive caches you can update in place (`cache[id].add(policy)`) instead of evicting, which is even cheaper than invalidation. Full eviction is only needed for *derived* values (counts, sorted lists, intersections) where adding one element can't be patched in O(1) for every shape.
- **Write-heavy caveat:** if writes vastly outnumber reads, skip caching entirely — the index lookups are already $O(1)$ and cache maintenance becomes pure overhead.
---
### Step 3 — If events are sorted by date: binary search
Suppose instead of (or in addition to) the hash index we keep events in an array **sorted by date**. Then `query_by_date(D)` becomes a contiguous-range problem: all events for `D` form one contiguous block, and we can find its boundaries with binary search — no full index needed for the date query, which saves the `date_to_ids` map's memory.
Use the standard lower-bound / upper-bound pair (`bisect_left` / `bisect_right`):
```python
import bisect
# events sorted ascending by date; keys = [e.date for e in events]
def query_by_date_sorted(events, keys, target_date):
lo = bisect.bisect_left(keys, target_date) # first index with date >= target
hi = bisect.bisect_right(keys, target_date) # first index with date > target
return {events[i].id for i in range(lo, hi)} # the [lo, hi) block is exactly target_date
```
- `bisect_left` returns the first position where `target_date` could be inserted keeping order → the start of the equal-date block.
- `bisect_right` returns the position just past the last equal element → the end (exclusive).
- The half-open range `[lo, hi)` is **exactly** the set of events on `target_date`.
#### Complexity
- **Locating the block:** $O(\log N)$ for the two binary searches.
- **Collecting ids:** $O(K)$ where $K = hi - lo$ is the number of events on that date.
- **Total:** $O(\log N + K)$, which is optimal — you must touch all $K$ results to return them.
- **Space:** $O(N)$ for the sorted array; **no `date_to_ids` map needed**, so this is strictly leaner if dates dominate memory.
**Practical notes:**
- ISO-8601 date strings sort lexicographically the same as chronologically (`"2026-01-31" < "2026-02-01"` as strings), so plain string comparison works as the sort key — no date parsing required. (This holds only for zero-padded, same-precision ISO dates; mixed formats break it.)
- Append-only + sorted-by-date is a happy case: if events arrive in date order (which a real-time log roughly does), appends keep the array sorted with no re-sort. If they can arrive out of order, you'd need an insertion (`O(N)` shift) or a different structure (e.g. a balanced BST / B-tree keyed by date), at which point the hash `date_to_ids` map is usually simpler and gives $O(1)$ date lookups.
- Binary search only helps `query_by_date` (the sorted key). `query_by_id` and `query_by_policy` still want their inverted hash maps; sorting by date does nothing for them.
---
### Step 4 — Trade-offs: latency vs. memory, and scaling
Let $N$ = events, $I$ = distinct ids, $P$ = distinct policies, $D$ = distinct dates.
| Design | Build | per-query | Extra memory | When to use |
|---|---|---|---|---|
| Naive scan | $O(1)$ | $O(N)$ | $O(N)$ | tiny logs / one-shot |
| Hash inverted indexes | $O(N)$ | $O(1)$+$O(K)$ | $\sim 2$–$3\times N$ | many mixed queries (default) |
| Sorted array + binary search (dates) | $O(N \log N)$ to sort | $O(\log N + K)$ date; others slow | $O(N)$ (no date map) | date queries dominate, memory-tight |
| Hybrid (hash for id/policy, sorted for date) | $O(N \log N)$ | $O(1)$ id/policy, $O(\log N + K)$ date | between the two | balanced, memory-conscious |
Key trade-offs:
- **Latency vs. memory.** The inverted indexes buy $O(1)$ queries by spending memory on three copies of the key relationships. Dropping `date_to_ids` in favor of a sorted array trades a bit of date-query latency ($O(1) \to O(\log N + K)$) for one fewer map — meaningful when $D$ is large or memory is the binding constraint.
- **Read-heavy vs. write-heavy.** Read-heavy → maximize precomputation/caching. Write-heavy (the append stream is busy) → keep the per-`add` cost minimal (the three $O(1)$ inserts) and avoid caches that need constant invalidation.
- **Result-set size $K$.** No design beats $O(K)$ to *return* a large answer; the index only removes the $O(N)$ *search*. If callers only need a **count** or **existence**, store cached cardinalities (`len`) and skip materializing the set — that's where precomputation pays off most.
Scaling behavior:
- **Number of events $N$:** memory grows linearly; query time stays flat with the hash index. The list/array form degrades linearly per query, which is why we index.
- **Distinct ids $I$ / policies $P$:** these bound the *number of keys* and the *fan-out* of each set. A policy violated by everyone gives a huge `policy_to_ids[policy]` set — $O(K)$ to return is unavoidable; consider pagination/streaming for such hot keys.
- **Beyond one machine:** if $N$ outgrows memory, partition (shard) by a key — e.g. shard `id_to_policies`/`policy_to_ids` by hash of the key, and shard `date_to_ids` by date range (which also keeps each shard naturally sorted for binary search). Old dates can be tiered to disk/columnar storage while recent dates stay hot in memory. At that scale this becomes a search-index / inverted-index system (the same shape as Lucene/Elasticsearch postings lists), and the in-memory version above is exactly the single-node core of it.
---
### Edge cases & correctness checklist
- **Unknown key:** return an empty set, not an error (`.get(key, set())`).
- **Duplicate events** `(id, policy, date)` repeated: sets dedupe automatically; counts (if needed) require a separate multiset/counter.
- **Same id, same policy, different dates:** `id_to_policies[id]` still holds `policy` once (correct — "policies violated" is a set), while `date_to_ids` records the id under each date.
- **Empty log:** all queries return empty sets; build is a no-op.
- **Mutation safety:** never hand out the live internal set unless documented read-only; prefer returning a copy/`frozenset`.
- **Time-range queries** (a likely follow-up): the sorted-by-date array generalizes for free — `bisect_left(start)` to `bisect_right(end)` gives an inclusive `[start, end]` range in $O(\log N + K)$, which the hash `date_to_ids` map cannot do without iterating every date in range.