Find returning users from access logs
Company: Amazon
Role: Software Engineer
Category: Coding & Algorithms
Difficulty: Medium
Interview Round: Onsite
Quick Answer: This question evaluates skills in time-series data processing, timestamp parsing and time-zone normalization, scalable stream or batch handling for logs that exceed memory, algorithmic complexity analysis, and test-case design for edge conditions such as midnight boundaries and duplicate events.
Constraints
- 0 <= len(events) <= 200000
- events[i] = [user_id, timestamp_iso8601]
- user_id is a non-empty string of length <= 64
- timestamp_iso8601 format: YYYY-MM-DDTHH:MM:SS[.ffffff](Z|±HH:MM)
- Different days are determined after converting timestamps to UTC
- Duplicate events do not increase day counts
- Return user_ids sorted lexicographically
Solution
from datetime import datetime, timezone
def returning_users(events: list[list[str]]) -> list[str]:
"""Return lexicographically sorted user_ids with visits on >= 2 different UTC dates."""
returning = set()
seen_dates = {} # user_id -> set of seen UTC dates (we only keep until 2nd unique date)
def to_utc_date(ts: str):
# Normalize 'Z' to '+00:00' for fromisoformat compatibility
if ts.endswith('Z'):
ts = ts[:-1] + '+00:00'
dt = datetime.fromisoformat(ts)
dt_utc = dt.astimezone(timezone.utc)
return dt_utc.date()
for user_id, ts in events:
if user_id in returning:
# Already classified as returning; no need to track more
continue
d = to_utc_date(ts)
prev = seen_dates.get(user_id)
if prev is None:
seen_dates[user_id] = {d}
elif d not in prev:
# Second unique day observed -> returning user
returning.add(user_id)
# Free memory for this user; no further tracking needed
del seen_dates[user_id]
# else: duplicate day for this user; ignore
return sorted(returning)
Explanation
Time complexity: O(n + r log r), where n is the number of events and r is the number of returning users (sorting cost).. Space complexity: O(u), where u is the number of users not yet classified as returning..
Hints
- Normalize timestamps to UTC and then take the calendar date (yyyy-mm-dd).
- In Python, replace a trailing 'Z' with '+00:00' and use datetime.fromisoformat, then astimezone(timezone.utc).date().
- Track per-user unique UTC dates; you only need to know whether a second, different date has been seen.
- Use a set for the resulting user_ids and sort before returning.
- Process events in a streaming manner to handle large logs without storing them all.