Merge Two Insertion Diffs in Linear Time
Company: Jane Street
Role: Software Engineer
Category: Coding & Algorithms
Difficulty: medium
Interview Round: Onsite
A text-editing system represents changes to a document as **insertion diffs**. An insertion diff is a set of `(position, value)` pairs, where each pair means: "after this diff is applied, the element `value` sits at index `position` of the resulting document." All elements of the pre-diff document keep their relative order and fill the remaining indices from left to right.
Formally, applying a diff `D` to a sequence `S` produces a sequence `T` of length `len(S) + len(D)`:
- For every pair `(p, v)` in `D`, `T[p] = v`.
- The elements of `S`, in their original order, occupy the remaining indices of `T` in increasing index order.
For example, applying `{0: "A", 3: "B"}` to `["z", "z", "z"]` yields `["A", "z", "z", "B", "z"]`.
Now suppose two diffs are applied one after the other:
1. `diff_one` is applied to the original document.
2. `diff_two` is then applied to the result of step 1 — so its positions refer to indices of the **final** document.
Continuing the example with the original document `["z", "z", "z"]`:
- `diff_one = {0: "A", 3: "B"}` produces `["A", "z", "z", "B", "z"]`.
- `diff_two = {0: "C", 4: "D"}` then produces `["C", "A", "z", "z", "D", "B", "z"]`.
Write a function `merge_diffs(diff_one, diff_two)` that returns a **single** diff equivalent to applying `diff_one` followed by `diff_two`. In the example above, the merged diff is `{0: "C", 1: "A", 4: "D", 5: "B"}`: applying it directly to `["z", "z", "z"]` yields `["C", "A", "z", "z", "D", "B", "z"]` in one step.
The document itself is never given and may be enormous, so you cannot materialize it. Your function must run in $O(n + m)$ time, where $n$ is the number of insertions in `diff_one` and $m$ is the number of insertions in `diff_two` — not in time proportional to the document length or to the largest position value.
**Input format**
- Each diff is given as a list of `(position, value)` pairs sorted by increasing `position` (equivalently, a mapping whose keys iterate in increasing order).
- Positions within a single diff are distinct non-negative integers.
- Values are non-empty strings and should be treated as opaque.
- You may assume both diffs are valid for the (unseen) document they are applied to: every position is a valid index of the corresponding post-application document.
**Output format**
Return the merged diff as a list of `(position, value)` pairs sorted by increasing `position`. Its size is exactly `n + m`.
**Examples**
Example 1:
- Input: `diff_one = [(0, "A"), (3, "B")]`, `diff_two = [(0, "C"), (4, "D")]`
- Output: `[(0, "C"), (1, "A"), (4, "D"), (5, "B")]`
Example 2 (position collision between the two diffs):
- Input: `diff_one = [(1, "P")]`, `diff_two = [(1, "Q")]`
- Output: `[(1, "Q"), (2, "P")]`
- Explanation: on an original document `["a", "b"]`, `diff_one` produces `["a", "P", "b"]`, and `diff_two` then produces `["a", "Q", "P", "b"]`. `diff_two`'s position is absolute in the final document, so `"Q"` occupies index 1 and `"P"` is pushed to index 2.
Example 3 (one empty diff):
- Input: `diff_one = []`, `diff_two = [(2, "X")]`
- Output: `[(2, "X")]`
**Constraints**
- $0 \le n, m \le 2 \cdot 10^5$
- $0 \le \text{position} \le 10^9$
- Required time complexity: $O(n + m)$
Quick Answer: This question evaluates algorithm design and sequence-manipulation skills, focusing on position-mapped insertions, order-preserving merging, and attention to time/space complexity.
A text-editing system represents changes to a document as **insertion diffs**. An insertion diff is a set of `(position, value)` pairs, where each pair means: after this diff is applied, the element `value` sits at index `position` of the resulting document. All elements of the pre-diff document keep their relative order and fill the remaining indices from left to right.
Formally, applying a diff `D` to a sequence `S` produces a sequence `T` of length `len(S) + len(D)`:
- For every pair `(p, v)` in `D`, `T[p] = v`.
- The elements of `S`, in their original order, occupy the remaining indices of `T` in increasing index order.
For example, applying `{0: "A", 3: "B"}` to `["z", "z", "z"]` yields `["A", "z", "z", "B", "z"]`.
Now suppose two diffs are applied one after the other:
1. `diff_one` is applied to the original document.
2. `diff_two` is then applied to the result of step 1 — so its positions refer to indices of the **final** document.
Continuing with the original document `["z", "z", "z"]`: `diff_one = {0: "A", 3: "B"}` produces `["A", "z", "z", "B", "z"]`, and `diff_two = {0: "C", 4: "D"}` then produces `["C", "A", "z", "z", "D", "B", "z"]`.
Write `merge_diffs(diff_one, diff_two)` returning a **single** diff equivalent to applying `diff_one` followed by `diff_two`. For the example, the merged diff is `[(0, "C"), (1, "A"), (4, "D"), (5, "B")]`: applying it directly to `["z", "z", "z"]` yields the same final document in one step.
The document itself is never given and may be enormous, so you cannot materialize it. Your function must run in **O(n + m)** time, where `n = len(diff_one)` and `m = len(diff_two)` — not proportional to the document length or to the largest position value.
**Input**: each diff is a list of `(position, value)` pairs sorted by increasing `position`; positions within one diff are distinct non-negative integers; values are opaque non-empty strings. Both diffs are valid for their (unseen) document.
**Output**: the merged diff as a list of `(position, value)` pairs sorted by increasing `position`; its size is exactly `n + m`.
Constraints
- 0 <= n, m <= 2 * 10^5 (n = len(diff_one), m = len(diff_two))
- 0 <= position <= 10^9
- Positions within a single diff are distinct; each diff is sorted by increasing position
- Values are opaque non-empty strings
- Required time complexity: O(n + m) — you may not materialize the document
Examples
Input: ([(0, 'A'), (3, 'B')], [(0, 'C'), (4, 'D')])
Expected Output: [(0, 'C'), (1, 'A'), (4, 'D'), (5, 'B')]
Explanation: The worked example. C takes final index 0, pushing A (was at T1 index 0) to 1; D takes index 4; B (was at T1 index 3) lands at 3 + 2 = 5.
Input: ([(1, 'P')], [(1, 'Q')])
Expected Output: [(1, 'Q'), (2, 'P')]
Explanation: Position collision. On doc ['a','b'], diff_one gives ['a','P','b']; diff_two's index 1 is absolute, so Q takes index 1 and P is pushed to index 2 (1 + d, d=1).
Input: ([], [(2, 'X')])
Expected Output: [(2, 'X')]
Explanation: diff_one is empty, so the merged diff is just diff_two unchanged.
Input: ([], [])
Expected Output: []
Explanation: Both diffs empty — nothing to insert.
Input: ([(0, 'A'), (3, 'B')], [])
Expected Output: [(0, 'A'), (3, 'B')]
Explanation: diff_two is empty (d stays 0), so every diff_one entry keeps its original position.
Input: ([(1, 'A'), (4, 'B')], [(0, 'C'), (3, 'D'), (7, 'E')])
Expected Output: [(0, 'C'), (2, 'A'), (3, 'D'), (6, 'B'), (7, 'E')]
Explanation: Ground-truth check on a len-4 original document. Intermediate T1 = [z,A,z,z,B,z]; applying diff_two yields final [C,z,A,D,z,z,B,E,z]. So A moves from T1 index 1 to 1+1=2, B moves from T1 index 4 to 4+2=6, and C,D,E sit at 0,3,7.
Hints
- Do not build the document. Its length is unknown and can be far larger than n + m.
- Think about where each diff_one insertion ends up in the FINAL document. The diff_two entries are the only things that shift it, and each shift is exactly +1.
- Track d = how many diff_two entries have been emitted so far. A diff_one entry originally at T1 index p lands at final index p + d.
- Merge the two sorted streams: at each step compare the diff_two position q with p + d. If q <= p + d, emit the diff_two entry (and increment d); otherwise emit the diff_one entry at p + d. Handle ties (q == p + d) in favor of diff_two — its slot is reserved.