Filter Invalid Data Events
Company: Atlassian
Role: Software Engineer
Category: Coding & Algorithms
Difficulty: easy
Interview Round: Technical Screen
Quick Answer: This question evaluates a candidate's ability to implement data validation, deduplication, stable filtering, time-range and allow-list checks, and to reason about time and space complexity for both batch and streaming scenarios.
Part 1: Filter Invalid Data Events from a Batch
Constraints
- 0 <= len(events) <= 200000
- 0 <= len(allowed_event_types) <= 10000
- start_time <= end_time
- An event_time is considered valid only if it is an integer Unix timestamp; booleans should be treated as malformed
- Missing keys count as missing fields
Examples
Input: ([{'event_id': 'e1', 'creator_id': 'u1', 'event_type': 'click', 'event_time': 10, 'payload': {'x': 1}}, {'event_id': '', 'creator_id': 'u2', 'event_type': 'click', 'event_time': 11, 'payload': None}, {'event_id': 'e2', 'creator_id': 'u3', 'event_type': 'purchase', 'event_time': 15, 'payload': {}}, {'event_id': 'e1', 'creator_id': 'u4', 'event_type': 'click', 'event_time': 12, 'payload': 'dup'}, {'event_id': 'e3', 'creator_id': 'u5', 'event_type': 'view', 'event_time': 21, 'payload': 0}, {'event_id': 'e4', 'creator_id': 'u6', 'event_type': 'view', 'event_time': 20, 'payload': []}], ['click', 'view'], 10, 20)
Expected Output: [{'event_id': 'e1', 'creator_id': 'u1', 'event_type': 'click', 'event_time': 10, 'payload': {'x': 1}}, {'event_id': 'e4', 'creator_id': 'u6', 'event_type': 'view', 'event_time': 20, 'payload': []}]
Explanation: The empty event_id, disallowed event_type, duplicate valid event_id, and out-of-range timestamp are filtered out.
Input: ([{'event_id': 'z1', 'creator_id': 'u1', 'event_type': 'click', 'event_time': '100', 'payload': None}, {'event_id': 'z1', 'creator_id': 'u1', 'event_type': 'click', 'event_time': 100, 'payload': {'ok': True}}], ['click'], 0, 200)
Expected Output: [{'event_id': 'z1', 'creator_id': 'u1', 'event_type': 'click', 'event_time': 100, 'payload': {'ok': True}}]
Explanation: The first event is malformed because event_time is a string, so it does not block the later valid event with the same event_id.
Input: ([], ['click', 'view'], 0, 10)
Expected Output: []
Explanation: Edge case: empty input should return an empty list.
Input: ([{'event_id': 'a', 'creator_id': 'u', 'event_type': 'view', 'event_time': 5, 'payload': None}, {'event_id': 'b', 'creator_id': 'u', 'event_type': 'view', 'event_time': True, 'payload': None}, {'event_id': 'c', 'creator_id': None, 'event_type': 'view', 'event_time': 10, 'payload': None}, {'event_id': 'd', 'creator_id': 'u', 'event_type': 'view', 'event_time': 15, 'payload': None}], ['view'], 5, 15)
Expected Output: [{'event_id': 'a', 'creator_id': 'u', 'event_type': 'view', 'event_time': 5, 'payload': None}, {'event_id': 'd', 'creator_id': 'u', 'event_type': 'view', 'event_time': 15, 'payload': None}]
Explanation: Boundary timestamps 5 and 15 are valid. True is treated as malformed, and creator_id=None is invalid.
Hints
- Use a hash set for the allowed event types so membership checks are fast.
- Only add an event_id to your seen set after the event passes every other validation rule.
Part 2: Streaming Event Acceptance Decisions
Constraints
- 0 <= len(events) <= 200000
- 0 <= len(allowed_event_types) <= 10000
- start_time <= end_time
- An event_time is considered valid only if it is an integer Unix timestamp; booleans should be treated as malformed
- For exact duplicate detection in an unbounded real stream, the set of accepted event_ids can grow without bound
Examples
Input: ([{'event_id': 'a', 'creator_id': 'u1', 'event_type': 'click', 'event_time': 5, 'payload': None}, {'event_id': 'b', 'creator_id': '', 'event_type': 'click', 'event_time': 6, 'payload': None}, {'event_id': 'a', 'creator_id': 'u2', 'event_type': 'click', 'event_time': 7, 'payload': None}, {'event_id': 'c', 'creator_id': 'u3', 'event_type': 'view', 'event_time': 11, 'payload': None}, {'event_id': 'd', 'creator_id': 'u4', 'event_type': 'view', 'event_time': 10, 'payload': None}], ['click', 'view'], 5, 10)
Expected Output: [True, False, False, False, True]
Explanation: The first and last events are accepted. The others fail due to empty creator_id, duplicate accepted event_id, and out-of-range timestamp.
Input: ([{'event_id': 'z1', 'creator_id': 'u1', 'event_type': 'click', 'event_time': None, 'payload': None}, {'event_id': 'z1', 'creator_id': 'u1', 'event_type': 'click', 'event_time': 100, 'payload': {'ok': True}}], ['click'], 0, 200)
Expected Output: [False, True]
Explanation: An invalid event does not reserve its event_id, so the later valid event is accepted.
Input: ([], ['view'], 0, 10)
Expected Output: []
Explanation: Edge case: no arrivals means no decisions.
Input: ([{'event_id': 'm1', 'creator_id': 'u', 'event_type': 'view', 'event_time': 1, 'payload': None}, {'event_id': 'm2', 'creator_id': 'u', 'event_type': 'view', 'event_time': True, 'payload': None}, {'event_id': 'm3', 'creator_id': None, 'event_type': 'view', 'event_time': 2, 'payload': None}, {'event_id': 'm4', 'creator_id': 'u', 'event_type': 'view', 'event_time': 3, 'payload': None}], ['view'], 1, 3)
Expected Output: [True, False, False, True]
Explanation: Boundary timestamps are accepted, but True is malformed as a timestamp and creator_id=None is invalid.
Hints
- Think about what state must survive between arrivals. You do not need to store all past events.
- A hash set of accepted event_ids is enough for exact duplicate checks, but only add an ID after accepting the event.