Median, Variance, and Linear Trend of a Daily Temperature Dataset
Company: Two Sigma
Role: Data Scientist
Category: Coding & Algorithms
Difficulty: easy
Interview Round: Take-home Project
# Median, Variance, and Linear Trend of a Daily Temperature Dataset
You are given `n` daily temperature readings collected in New York City. The data is a list of records `[day, temp]` where:
- `day` is an integer day index. All day values are **distinct**, but the list is **not necessarily sorted**.
- `temp` is the temperature reading for that day, a floating-point number.
You are also given an integer `q` — a query day index, which may lie outside the observed range of days.
Compute the following four things and return them in order:
1. **Median temperature.** Sort the temperatures. If `n` is odd, the median is the middle value; if `n` is even, it is the arithmetic mean of the two middle values.
2. **Sample variance of the temperatures**, using the `n - 1` denominator:
$$s^2 = \frac{1}{n-1} \sum_{i=1}^{n} (y_i - \bar{y})^2$$
where $y_i$ are the temperatures and $\bar{y}$ is their mean. (`n >= 2` is guaranteed.)
3. **Ordinary least-squares simple linear regression** of temperature on day index — the slope `b` and intercept `a` of the line `temp = a + b * day` that minimizes the sum of squared residuals:
$$b = \frac{\sum_{i=1}^{n} (x_i - \bar{x})(y_i - \bar{y})}{\sum_{i=1}^{n} (x_i - \bar{x})^2}, \qquad a = \bar{y} - b\,\bar{x}$$
where $x_i$ are the day indices. Because all day values are distinct and `n >= 2`, the denominator is never zero.
4. **Predicted temperature for the query day**: $\hat{y} = a + b \cdot q$.
## Input
- `records`: a list of `n` pairs `[day, temp]` with distinct integer `day` values and float `temp` values.
- `q`: an integer query day index.
## Output
- A list of five floats: `[median, sample_variance, slope, intercept, prediction]`.
## Constraints
- `2 <= n <= 10^5`
- `0 <= day <= 10^6`, all `day` values distinct
- `-100.0 <= temp <= 150.0`
- `0 <= q <= 2 * 10^6`
- Answers within an absolute error of `10^-4` of the reference values are accepted.
## Example 1
Input:
```
records = [[0, 30.0], [1, 34.0], [2, 38.0], [3, 42.0]]
q = 5
```
Output:
```
[36.0, 26.666667, 4.0, 30.0, 50.0]
```
Explanation: Sorted temperatures are `[30, 34, 38, 42]`, so the median is `(34 + 38) / 2 = 36.0`. The mean is `36.0`, and the sample variance is `(36 + 4 + 4 + 36) / 3 = 26.666667`. The best-fit line is `temp = 30.0 + 4.0 * day`, so the prediction for day `5` is `50.0`.
## Example 2
Input:
```
records = [[2, 50.0], [0, 54.0]]
q = 1
```
Output:
```
[52.0, 8.0, -2.0, 54.0, 52.0]
```
Explanation: The median of `[50, 54]` is `52.0` and the sample variance is `((54 - 52)^2 + (50 - 52)^2) / 1 = 8.0`. The regression line through the two points is `temp = 54.0 - 2.0 * day`, giving a prediction of `52.0` for day `1`.
Quick Answer: This question evaluates competency in statistical data processing and numerical computation, focusing on median and sample variance estimation, ordinary least-squares linear regression (slope and intercept), and point prediction from unsorted time-indexed temperature readings.
You are given `n` daily temperature readings collected in New York City as a list of records `[day, temp]`:
- `day` is an integer day index. All day values are **distinct**, but the list is **not necessarily sorted**.
- `temp` is the temperature reading for that day, a floating-point number.
You are also given an integer `q` — a query day index, which may lie outside the observed range of days.
Compute the following four things and return them in order:
1. **Median temperature.** Sort the temperatures. If `n` is odd, the median is the middle value; if `n` is even, it is the arithmetic mean of the two middle values.
2. **Sample variance of the temperatures**, using the `n - 1` denominator: `s^2 = (1/(n-1)) * sum((y_i - ybar)^2)`, where `y_i` are the temperatures and `ybar` is their mean (`n >= 2` is guaranteed).
3. **Ordinary least-squares simple linear regression** of temperature on day index — the slope `b` and intercept `a` of the line `temp = a + b * day` that minimizes the sum of squared residuals: `b = sum((x_i - xbar)(y_i - ybar)) / sum((x_i - xbar)^2)` and `a = ybar - b * xbar`, where `x_i` are the day indices. Because all day values are distinct and `n >= 2`, the denominator is never zero.
4. **Predicted temperature for the query day**: `yhat = a + b * q`.
**Input:** `records` — a list of `n` pairs `[day, temp]` with distinct integer `day` values and float `temp` values; `q` — an integer query day index.
**Output:** A list of five floats `[median, sample_variance, slope, intercept, prediction]`.
Answers within an absolute error of `1e-4` of the reference values are accepted.
Constraints
- 2 <= n <= 10^5
- 0 <= day <= 10^6, all day values distinct
- -100.0 <= temp <= 150.0
- 0 <= q <= 2 * 10^6
- Answers within an absolute error of 1e-4 of the reference values are accepted
Examples
Input: ([[0, 30.0], [1, 34.0], [2, 38.0], [3, 42.0]], 5)
Expected Output: [36.0, 26.666666666666668, 4.0, 30.0, 50.0]
Explanation: Sorted temps [30, 34, 38, 42] give median (34+38)/2 = 36.0. Mean 36.0, so variance = (36+4+4+36)/3 = 26.6667. Best-fit line temp = 30 + 4*day, so prediction at day 5 is 50.0.
Input: ([[2, 50.0], [0, 54.0]], 1)
Expected Output: [52.0, 8.0, -2.0, 54.0, 52.0]
Explanation: Median of [50, 54] is 52.0; variance = ((54-52)^2 + (50-52)^2)/1 = 8.0. Line through the two points is temp = 54 - 2*day, giving 52.0 at day 1. Note the list is unsorted by day.
Input: ([[5, 10.0], [1, 2.0], [3, 6.0]], 10)
Expected Output: [6.0, 16.0, 2.0, 0.0, 20.0]
Explanation: Odd n=3: sorted temps [2, 6, 10] give median 6.0. Mean 6.0, variance = (16+0+16)/2 = 16.0. The points lie exactly on temp = 2*day, so slope 2, intercept 0, prediction at day 10 is 20.0.
Input: ([[0, -40.0], [4, -20.0], [2, -25.0], [6, -10.0]], 8)
Expected Output: [-22.5, 156.25, 4.75, -38.0, 0.0]
Explanation: Negative temps with unsorted days. Sorted temps [-40, -25, -20, -10] give median (-25 + -20)/2 = -22.5. Mean -23.75, variance = 468.75/3 = 156.25. OLS gives slope 4.75, intercept -38.0, prediction at day 8 = 0.0.
Input: ([[100, 20.0], [50, 20.0]], 999)
Expected Output: [20.0, 0.0, 0.0, 20.0, 20.0]
Explanation: n=2 minimum with equal temps: median 20.0, variance 0.0. All temps identical means the best-fit line is horizontal (slope 0, intercept 20), so the prediction for any query day (999) is 20.0.
Hints
- Median only needs the sorted temperatures — the day indices don't matter for it. Sorting a copy of the temps (O(n log n)) is enough; a linear-time selection is optional.
- Sample variance uses the n-1 denominator (Bessel's correction), not n. Compute the temperature mean first, then sum the squared deviations.
- For the OLS slope, compute the means xbar (of days) and ybar (of temps) in one pass, then b = sum((x-xbar)(y-ybar)) / sum((x-xbar)^2). The intercept is a = ybar - b*xbar, and the prediction is simply a + b*q.
- The regression uses day as the independent variable x and temp as the dependent variable y — don't swap them. The median/variance use only y (temps).