Design a violation log analyzer
Company: Pinterest
Role: Software Engineer
Category: Coding & Algorithms
Difficulty: Medium
Interview Round: Technical Screen
Quick Answer: This question evaluates proficiency in data structures, in-memory indexing and query design, algorithmic time/space complexity analysis, and reasoning about state management for dynamic event streams.
Part 1: Straightforward Violation Log Analyzer
Constraints
- 0 <= len(events) <= 2000
- 0 <= len(queries) <= 2000
- Each event has exactly three strings: [id, policy, date].
- Dates use a consistent ISO-8601 date format such as YYYY-MM-DD.
- Duplicate events may appear, but output values must be unique.
Examples
Input: ([['alice', 'spam', '2024-01-01'], ['bob', 'fraud', '2024-01-01'], ['alice', 'fraud', '2024-01-02'], ['alice', 'spam', '2024-01-01']], [['id', 'alice'], ['policy', 'fraud'], ['date', '2024-01-01'], ['id', 'carol']])
Expected Output: [['fraud', 'spam'], ['alice', 'bob'], ['alice', 'bob'], []]
Explanation: alice violated fraud and spam; fraud was violated by alice and bob; two ids appear on 2024-01-01; carol has no events.
Input: ([], [['id', 'x'], ['policy', 'P1'], ['date', '2024-01-01']])
Expected Output: [[], [], []]
Explanation: The empty log produces empty results for every query type.
Input: ([['u1', 'P1', '2025-12-31']], [['id', 'u1'], ['policy', 'P1'], ['date', '2025-01-01']])
Expected Output: [['P1'], ['u1'], []]
Explanation: The single event matches the id and policy queries but not the different date.
Input: ([['u3', 'B', '2024-02-02'], ['u1', 'C', '2024-02-02'], ['u2', 'A', '2024-02-03'], ['u3', 'A', '2024-02-02']], [['date', '2024-02-02'], ['id', 'u3'], ['policy', 'A']])
Expected Output: [['u1', 'u3'], ['A', 'B'], ['u2', 'u3']]
Explanation: The analyzer scans all events for each query and returns unique sorted values.
Solution
def solution(events, queries):
results = []
for query in queries:
kind, value = query
found = set()
if kind == 'id':
for eid, policy, date in events:
if eid == value:
found.add(policy)
elif kind == 'policy':
for eid, policy, date in events:
if policy == value:
found.add(eid)
elif kind == 'date':
for eid, policy, date in events:
if date == value:
found.add(eid)
else:
raise ValueError('unknown query type')
results.append(sorted(found))
return resultsTime complexity: Let N be the number of events, Q be the number of queries, and r_i be the number of distinct values returned by query i. Each query scans all events, so the total time is O(Q * N + sum(r_i log r_i)) because each result is sorted.. Space complexity: O(max r_i) auxiliary space for the set used by one query, not counting the returned output. Including output, space is O(sum r_i)..
Hints
- For a direct solution, you do not need extra data structures beyond a set for the current query.
- Think about which field to compare for each query type, and which field should be added to the answer.
Part 2: Violation Log Analyzer with Inverted Indexes
Constraints
- 0 <= len(events) <= 200000
- 0 <= len(queries) <= 200000
- Each event has exactly three strings: [id, policy, date].
- Dates use a consistent ISO-8601 date format such as YYYY-MM-DD.
- The total size of all returned lists is bounded by available memory.
Examples
Input: ([['u1', 'A', '2024-05-01'], ['u2', 'A', '2024-05-01'], ['u1', 'B', '2024-05-02'], ['u3', 'A', '2024-05-03']], [['policy', 'A'], ['id', 'u1'], ['date', '2024-05-01'], ['policy', 'Z']])
Expected Output: [['u1', 'u2', 'u3'], ['A', 'B'], ['u1', 'u2'], []]
Explanation: The policy, id, and date indexes answer the queries directly. Missing policy Z returns an empty list.
Input: ([], [['id', 'missing'], ['policy', 'missing'], ['date', '2024-01-01']])
Expected Output: [[], [], []]
Explanation: All indexes are empty.
Input: ([['u1', 'P', '2024-01-01'], ['u1', 'P', '2024-01-01'], ['u1', 'Q', '2024-01-01']], [['id', 'u1'], ['policy', 'P'], ['date', '2024-01-01']])
Expected Output: [['P', 'Q'], ['u1'], ['u1']]
Explanation: Duplicate events do not create duplicate output values because index values are sets.
Input: ([['solo', 'Only', '2024-12-31']], [['policy', 'Only'], ['date', '2024-12-30'], ['id', 'solo']])
Expected Output: [['solo'], [], ['Only']]
Explanation: A single event is indexed in all three inverted indexes.
Solution
def solution(events, queries):
id_to_policies = {}
policy_to_ids = {}
date_to_ids = {}
for event in events:
eid, policy, date = event
if eid not in id_to_policies:
id_to_policies[eid] = set()
id_to_policies[eid].add(policy)
if policy not in policy_to_ids:
policy_to_ids[policy] = set()
policy_to_ids[policy].add(eid)
if date not in date_to_ids:
date_to_ids[date] = set()
date_to_ids[date].add(eid)
results = []
for query in queries:
kind, value = query
if kind == 'id':
results.append(sorted(id_to_policies.get(value, set())))
elif kind == 'policy':
results.append(sorted(policy_to_ids.get(value, set())))
elif kind == 'date':
results.append(sorted(date_to_ids.get(value, set())))
else:
raise ValueError('unknown query type')
return resultsTime complexity: Let N be the number of events and r_i be the number of distinct values returned by query i. Building the indexes takes O(N) average time. Answering all queries takes O(Q + sum(r_i log r_i)) because each lookup is O(1) average and each returned set is sorted.. Space complexity: O(N) for the three inverted indexes in the worst case, plus O(sum r_i) for the returned output..
Hints
- Build dictionaries whose values are sets so duplicates are removed automatically.
- After the index is built, each query should avoid scanning the entire event list.
Part 3: Cached Violation Log Analyzer with Append Invalidation
Constraints
- 0 <= len(events) <= 200000
- 0 <= len(operations) <= 200000
- Each initial event and append operation contains exactly one id, one policy, and one date.
- Dates use a consistent ISO-8601 date format such as YYYY-MM-DD.
- Duplicate events may be appended; outputs must still contain unique values.
Examples
Input: ([['u1', 'P1', '2024-01-01'], ['u2', 'P1', '2024-01-01']], [['query_id', 'u1'], ['query_id', 'u1'], ['append', 'u1', 'P2', '2024-01-02'], ['query_id', 'u1'], ['query_policy', 'P2'], ['query_date', '2024-01-02']])
Expected Output: [['P1'], ['P1'], ['P1', 'P2'], ['u1'], ['u1']]
Explanation: The repeated first query can be cached. After appending a new event for u1, the cached query_id result for u1 must be invalidated.
Input: ([], [['query_id', 'x'], ['append', 'x', 'A', '2025-01-01'], ['query_id', 'x'], ['query_policy', 'A'], ['query_date', '2025-01-01']])
Expected Output: [[], ['A'], ['x'], ['x']]
Explanation: A missing id initially returns empty. After append, the affected id, policy, and date queries return the new data.
Input: ([['a', 'P', '2024-01-01']], [['query_policy', 'P'], ['query_date', '2024-01-01'], ['append', 'b', 'P', '2024-01-02'], ['query_policy', 'P'], ['query_date', '2024-01-01'], ['query_date', '2024-01-02']])
Expected Output: [['a'], ['a'], ['a', 'b'], ['a'], ['b']]
Explanation: Appending b to policy P invalidates policy P and date 2024-01-02, but not the cached result for date 2024-01-01.
Input: ([], [['append', 'x', 'P', '2024-03-03'], ['query_id', 'x'], ['append', 'x', 'P', '2024-03-03'], ['query_id', 'x'], ['query_policy', 'P'], ['query_date', '2024-03-03']])
Expected Output: [['P'], ['P'], ['x'], ['x']]
Explanation: Appending a duplicate event does not duplicate values in the output.
Solution
def solution(events, operations):
id_to_policies = {}
policy_to_ids = {}
date_to_ids = {}
def add_event(eid, policy, date):
if eid not in id_to_policies:
id_to_policies[eid] = set()
id_to_policies[eid].add(policy)
if policy not in policy_to_ids:
policy_to_ids[policy] = set()
policy_to_ids[policy].add(eid)
if date not in date_to_ids:
date_to_ids[date] = set()
date_to_ids[date].add(eid)
for event in events:
add_event(event[0], event[1], event[2])
id_cache = {}
policy_cache = {}
date_cache = {}
results = []
for op in operations:
kind = op[0]
if kind == 'append':
eid, policy, date = op[1], op[2], op[3]
add_event(eid, policy, date)
id_cache.pop(eid, None)
policy_cache.pop(policy, None)
date_cache.pop(date, None)
elif kind == 'query_id':
eid = op[1]
if eid not in id_cache:
id_cache[eid] = sorted(id_to_policies.get(eid, set()))
results.append(list(id_cache[eid]))
elif kind == 'query_policy':
policy = op[1]
if policy not in policy_cache:
policy_cache[policy] = sorted(policy_to_ids.get(policy, set()))
results.append(list(policy_cache[policy]))
elif kind == 'query_date':
date = op[1]
if date not in date_cache:
date_cache[date] = sorted(date_to_ids.get(date, set()))
results.append(list(date_cache[date]))
else:
raise ValueError('unknown operation type')
return resultsTime complexity: Let N be the initial number of events. Building indexes takes O(N). Each append is O(1) average time for index updates and O(1) average time for cache invalidation. A cache-miss query returning r values costs O(r log r) to sort. A cache-hit query costs O(r) here because the cached list is copied into the output.. Space complexity: O(N + A) for the materialized indexes after A appended events, plus O(C) for cached sorted answers, where C is the total size of cached result lists. Returned output also uses O(total returned values)..
Hints
- Use sets in the indexes as the source of truth, and cache only the sorted list representation.
- An appended event with id x, policy p, and date d can only change cached answers for query_id x, query_policy p, and query_date d.
Part 4: Binary Search Date Lookup in a Sorted Violation Log
Constraints
- 0 <= len(events) <= 200000
- 0 <= len(dates) <= 200000
- events is sorted in nondecreasing order by the date field.
- Dates use a consistent ISO-8601 date format such as YYYY-MM-DD, so lexicographic comparison matches chronological order.
- Duplicate ids may appear multiple times on the same date, but each output list must contain unique ids.
Examples
Input: ([['u1', 'P1', '2024-01-01'], ['u2', 'P2', '2024-01-01'], ['u1', 'P3', '2024-01-02'], ['u3', 'P1', '2024-01-04']], ['2024-01-01', '2024-01-03', '2024-01-04'])
Expected Output: [['u1', 'u2'], [], ['u3']]
Explanation: Binary search locates the date ranges for the first and last queries. 2024-01-03 is absent.
Input: ([], ['2024-01-01'])
Expected Output: [[]]
Explanation: An empty sorted log has no ids for any date.
Input: ([['a', 'X', '2024-12-31']], ['2024-12-30', '2024-12-31'])
Expected Output: [[], ['a']]
Explanation: The target before the only event is absent; the exact date matches.
Input: ([['b', 'P', '2024-01-01'], ['a', 'Q', '2024-01-02'], ['a', 'R', '2024-01-02'], ['c', 'Q', '2024-01-02'], ['d', 'P', '2024-01-05']], ['2024-01-02', '2024-01-05'])
Expected Output: [['a', 'c'], ['d']]
Explanation: Multiple events for id a on 2024-01-02 appear only once in the result.
Input: ([['a', 'P', '2024-02-10'], ['b', 'Q', '2024-02-11']], ['2024-01-01', '2024-12-31'])
Expected Output: [[], []]
Explanation: Dates outside the sorted log boundaries return empty lists.
Solution
def solution(events, dates):
n = len(events)
def lower_bound(target):
lo, hi = 0, n
while lo < hi:
mid = (lo + hi) // 2
if events[mid][2] < target:
lo = mid + 1
else:
hi = mid
return lo
def upper_bound(target):
lo, hi = 0, n
while lo < hi:
mid = (lo + hi) // 2
if events[mid][2] <= target:
lo = mid + 1
else:
hi = mid
return lo
results = []
for target in dates:
left = lower_bound(target)
right = upper_bound(target)
ids = set()
for i in range(left, right):
ids.add(events[i][0])
results.append(sorted(ids))
return resultsTime complexity: Let N be the number of events, Q be the number of target dates, k_i be the number of events matching query i, and r_i be the number of distinct ids returned by query i. Each query costs O(log N + k_i + r_i log r_i). Total time is O(Q log N + sum k_i + sum(r_i log r_i)).. Space complexity: O(max r_i) auxiliary space for the set of ids for one date, not counting returned output. Including output, space is O(sum r_i)..
Hints
- Use one binary search to find the first event whose date is not less than the target.
- Use another binary search to find the first event whose date is greater than the target; the answer lies between those two positions.