Fill Missing Time-Series Values with Linear Interpolation (Duplicate Timestamps Allowed)
Company: Two Sigma
Role: Data Scientist
Category: Coding & Algorithms
Difficulty: easy
Interview Round: Take-home Project
# Fill Missing Time-Series Values with Linear Interpolation (Duplicate Timestamps Allowed)
You are given readings from a single sensor as a list of records. Each record is a pair `[timestamp, value]` where:
- `timestamp` is an integer.
- `value` is either a floating-point number or `null` (meaning the reading is missing).
The records are **not guaranteed to be sorted**, and the **same timestamp may appear more than once** (duplicate records).
Clean the series and fill in every missing value, in the following steps:
1. **Merge duplicate timestamps.** If a timestamp appears in multiple records:
- If at least one of its records has a known (non-`null`) value, the merged value for that timestamp is the **arithmetic mean of all known values** recorded at that timestamp.
- If **all** of its records are `null`, the merged result is a single missing point at that timestamp.
2. **Sort** the merged points by timestamp in ascending order.
3. **Interpolate interior gaps.** For each missing point at time `t` that lies strictly between two known points, let `(t1, v1)` be the nearest known point with `t1 < t` and `(t2, v2)` be the nearest known point with `t2 > t`. Fill it with the linear interpolation
$$v = v_1 + (v_2 - v_1) \cdot \frac{t - t_1}{t_2 - t_1}$$
4. **Fill the boundaries by holding the nearest known value.** Any missing point that comes before the first known point takes the value of the first known point; any missing point after the last known point takes the value of the last known point.
Return the cleaned series as a list of `[timestamp, value]` pairs sorted by ascending timestamp — exactly one pair per distinct timestamp, with every value filled.
## Input
- `records`: a list of `n` pairs `[timestamp, value]`, where `value` is a float or `null`.
## Output
- A list of `[timestamp, value]` pairs, one per distinct timestamp, sorted by ascending timestamp, with all values filled as floats.
## Constraints
- `1 <= n <= 10^5`
- `-10^9 <= timestamp <= 10^9`
- Known values satisfy `-10^6 <= value <= 10^6`
- At least one record has a known (non-`null`) value.
- Answers within an absolute error of `10^-6` of the reference values are accepted.
## Example 1
Input:
```
records = [[3, null], [1, 10.0], [5, 20.0], [3, null]]
```
Output:
```
[[1, 10.0], [3, 15.0], [5, 20.0]]
```
Explanation: The two duplicate records at timestamp `3` are both missing, so they merge into a single missing point. After sorting, the missing point at `t = 3` lies between `(1, 10.0)` and `(5, 20.0)`, so it is filled with `10.0 + (20.0 - 10.0) * (3 - 1) / (5 - 1) = 15.0`.
## Example 2
Input:
```
records = [[2, 4.0], [2, 6.0], [4, null], [6, 8.0], [0, null]]
```
Output:
```
[[0, 5.0], [2, 5.0], [4, 6.5], [6, 8.0]]
```
Explanation: Timestamp `2` appears twice with known values `4.0` and `6.0`, which merge to their mean `5.0`. The missing point at `t = 0` is before the first known point, so it holds the first known value `5.0`. The missing point at `t = 4` is interpolated between `(2, 5.0)` and `(6, 8.0)`: `5.0 + (8.0 - 5.0) * (4 - 2) / (6 - 2) = 6.5`.
Quick Answer: This question evaluates a candidate's ability to perform time-series data cleaning and aggregation of duplicate timestamps, plus numerical linear interpolation and boundary value propagation, within the Coding & Algorithms domain.
You are given readings from a single sensor as a list of records. Each record is a pair `[timestamp, value]` where `timestamp` is an integer and `value` is either a floating-point number or `null` (a missing reading). The records are NOT guaranteed to be sorted, and the same timestamp may appear more than once.
Clean the series and fill in every missing value:
1. **Merge duplicate timestamps.** If a timestamp appears in multiple records: if at least one record has a known (non-null) value, the merged value is the arithmetic mean of all known values at that timestamp; if all of its records are null, it becomes a single missing point.
2. **Sort** the merged points by ascending timestamp.
3. **Interpolate interior gaps.** For a missing point at time `t` strictly between two known points `(t1, v1)` (nearest known with `t1 < t`) and `(t2, v2)` (nearest known with `t2 > t`), fill it with `v = v1 + (v2 - v1) * (t - t1) / (t2 - t1)`.
4. **Fill boundaries by holding the nearest known value.** Any missing point before the first known point takes the first known value; any missing point after the last known point takes the last known value.
Return the cleaned series as a list of `[timestamp, value]` pairs — exactly one pair per distinct timestamp, sorted by ascending timestamp, with every value filled as a float.
**Example 1**
Input: `records = [[3, null], [1, 10.0], [5, 20.0], [3, null]]`
Output: `[[1, 10.0], [3, 15.0], [5, 20.0]]`
The two null records at t=3 merge to one missing point, interpolated between (1, 10.0) and (5, 20.0): 10.0 + 10.0*(3-1)/(5-1) = 15.0.
**Example 2**
Input: `records = [[2, 4.0], [2, 6.0], [4, null], [6, 8.0], [0, null]]`
Output: `[[0, 5.0], [2, 5.0], [4, 6.5], [6, 8.0]]`
t=2 appears twice (4.0, 6.0) → mean 5.0. t=0 is before the first known point → holds 5.0. t=4 is interpolated between (2, 5.0) and (6, 8.0): 5.0 + 3.0*(4-2)/(6-2) = 6.5.
Constraints
- 1 <= n <= 10^5
- -10^9 <= timestamp <= 10^9
- -10^6 <= value <= 10^6 for known values
- At least one record has a known (non-null) value.
- Answers within an absolute error of 10^-6 of the reference are accepted.
Examples
Input: ([[3, None], [1, 10.0], [5, 20.0], [3, None]],)
Expected Output: [[1, 10.0], [3, 15.0], [5, 20.0]]
Explanation: The two null records at t=3 merge into one missing point, interpolated between (1, 10.0) and (5, 20.0): 10.0 + 10.0*(3-1)/(5-1) = 15.0.
Input: ([[2, 4.0], [2, 6.0], [4, None], [6, 8.0], [0, None]],)
Expected Output: [[0, 5.0], [2, 5.0], [4, 6.5], [6, 8.0]]
Explanation: t=2 averages 4.0 and 6.0 to 5.0. t=0 precedes the first known point so it holds 5.0. t=4 interpolates between (2, 5.0) and (6, 8.0): 5.0 + 3.0*(4-2)/(6-2) = 6.5.
Input: ([[5, 3.0]],)
Expected Output: [[5, 3.0]]
Explanation: A single known point is returned unchanged.
Input: ([[1, 10.0], [1, None], [2, None], [3, 30.0]],)
Expected Output: [[1, 10.0], [2, 20.0], [3, 30.0]]
Explanation: At t=1 the null is ignored, leaving the known value 10.0. The missing point at t=2 interpolates between (1, 10.0) and (3, 30.0): 10.0 + 20.0*(2-1)/(3-1) = 20.0.
Input: ([[0, 5.0], [1, None], [2, None]],)
Expected Output: [[0, 5.0], [1, 5.0], [2, 5.0]]
Explanation: Both missing points come after the last (and only) known point, so they hold its value 5.0.
Input: ([[2, -4.0], [0, None], [-1, None]],)
Expected Output: [[-1, -4.0], [0, -4.0], [2, -4.0]]
Explanation: After sorting, the two missing points precede the only known point (2, -4.0), so they hold -4.0.
Hints
- First group records by timestamp: accumulate a running sum and count of only the non-null values per timestamp so you can compute the mean; a timestamp with zero known values stays a missing point.
- After sorting the distinct timestamps, record the indices of the points that have a known value. Every missing point falls into exactly one of three buckets: before the first known index, after the last known index, or strictly between two consecutive known indices.
- For interior gaps, walk between each pair of consecutive known indices and apply v = v1 + (v2 - v1) * (t - t1) / (t2 - t1); for the boundary buckets simply hold the nearest known value.