Design a social network with snapshots
Company: OpenAI
Role: Software Engineer
Category: Software Engineering Fundamentals
Difficulty: medium
Interview Round: Technical Screen
You are asked to design and implement an in-memory `SocialNetwork` class that supports users following each other and creating **snapshots** of the follow graph. A snapshot is an **immutable view** of all follow relationships at the moment it was created; later mutations to the live network must not change any previously taken snapshot.
Implement an API similar to:
```text
sn = SocialNetwork()
sn.add_user("A")
sn.add_user("B")
sn.follow("A", "B")
snap = sn.create_snapshot()
assert snap.is_follow("A", "B")
sn.follow("A", "C") # mutates the live graph only
assert not snap.is_follow("A", "C") # the old snapshot is unaffected
```
The follow relation is **directed**: `follow(u, v)` means *u follows v* and does not imply *v follows u*. Your task is to design the underlying data structures, choose a snapshot strategy, and implement the three operations below.
### Constraints & Assumptions
- In-memory, single-process; no persistence required.
- User identifiers are hashable (treat them as strings/ints).
- `is_follow`, listing followers/followees, and recommendations all operate **on a given snapshot**, not the live graph.
- Snapshots may be taken frequently relative to the number of edges; reads on a snapshot should be cheap.
- State the asymptotic time/space cost of each operation and of `create_snapshot` itself.
### Clarifying Questions to Ask
- If `u` or `v` does not exist when calling `follow` / `is_follow` / a listing, should the method throw, auto-create the user, or return a benign default (false / empty)?
- Is self-follow (`follow(u, u)`) permitted, or rejected?
- Are repeated `follow(u, v)` calls idempotent (no duplicate edge, no inflated counts)?
- Roughly how many snapshots are expected versus how many users/edges — i.e. is the bottleneck snapshot creation, snapshot memory, or query latency?
- Is there an `unfollow` operation, and must snapshots taken before/after it reflect the right state?
### Part 1 — Snapshot follow query
Support `snap.is_follow(u, v)` returning whether user `u` follows user `v` **as of that snapshot**.
```hint Data structure
The query is membership in *u*'s set of followees. A `Map<User, Set<User>>` keyed by the follower gives $O(1)$ average lookup; decide what to return when `u` is absent from the snapshot.
```
#### What This Part Should Cover
- The core adjacency representation chosen and why it makes `is_follow` average $O(1)$.
- Clear, consistent behavior when `u` (or `v`) is not present in the snapshot.
### Part 2 — List followers and followees
For a given snapshot and user `x`, return:
- the **followers** of `x` (everyone who follows `x`), and
- the **followees** of `x` (everyone `x` follows).
```hint Avoid scanning the whole graph
Returning followers from an out-edge map alone forces an $O(V+E)$ scan. Maintaining a **second** adjacency map of *incoming* edges (`in[v] = followers of v`) makes both listings $O(\text{degree})$, at the cost of updating two maps on every `follow`.
```
#### What This Part Should Cover
- Recognizing that efficient follower listing requires either a reverse index or accepting a full-graph scan, and the time/space trade-off between them.
- Returning copies or read-only views so callers cannot mutate the snapshot.
### Part 3 — Recommend users to follow
Given a snapshot, a user `u`, and an integer `k`, recommend up to `k` users for `u` to follow using a **friends-of-friends** ranking:
1. Take `F`, the set of users `u` already follows.
2. For each `f ∈ F`, look at whom `f` follows.
3. Count, across all of `u`'s followees, how many times each candidate `c` appears.
4. Return the top-`k` candidates by that count, excluding `u` itself and anyone already in `F`.
```hint Where to start
This is a 2-hop traversal: iterate `u`'s followees, then *their* followees, accumulating a frequency count in a hash map. Decide the exclusion set (`u`, plus everyone already in `F`) before counting.
```
```hint Top-k selection
You don't need to fully sort all `m` candidates. A size-`k` min-heap keyed by count gives the top-`k` in $O(m \log k)$ vs. $O(m \log m)$ for a full sort. Define a deterministic tie-break (e.g. by user id) so results are reproducible.
```
#### What This Part Should Cover
- Correct 2-hop traversal with the right exclusions (self, already-followed) applied before/while counting.
- A top-`k` selection strategy with stated complexity and an explicit, deterministic tie-break rule.
- Counting cost expressed in terms of followee count and average out-degree, e.g. $O\!\left(\sum_{f \in F} \deg^{+}(f)\right)$.
### What a Strong Answer Covers
These dimensions span all three parts.
- A coherent **snapshot strategy** (full copy vs. copy-on-write / persistent structures vs. versioned edges) with the trade-offs that decide which fits the stated constraints, and a guarantee that snapshots are truly isolated from later mutations.
- Consistent handling of the clarified edge cases (missing users, self-follow, idempotent follows) across every method.
- Stated asymptotic time and space for `follow`, `create_snapshot`, and each snapshot query.
### Follow-up Questions
- If snapshots are taken very frequently on a large graph, full-copy snapshots become $O(V+E)$ each. How would persistent/copy-on-write data structures or versioned (append-only) edges reduce snapshot cost, and what do they cost you on the read path?
- How would you add `unfollow(u, v)` while keeping every previously taken snapshot correct?
- How would you make the live network safe under concurrent `follow`/`create_snapshot` from multiple threads?
- The recommendation only uses follow-count. How would you incorporate signals like mutual follows, recency, or excluding blocked users without changing the asymptotic cost much?
Quick Answer: This question evaluates understanding of data structures, graph representations, immutability and snapshot/versioning strategies for in-memory systems, along with algorithmic reasoning for query and recommendation operations.