PracHub
QuestionsPremiumLearningGuidesCheatsheetNEW

Quick Overview

This question evaluates proficiency in streaming data transformation and time-series alignment, including handling sparse-to-dense mapping, ordering guarantees, and memory/latency trade-offs.

  • easy
  • Jane Street
  • Coding & Algorithms
  • Software Engineer

Transform sparse time-code stream to dense rows

Company: Jane Street

Role: Software Engineer

Category: Coding & Algorithms

Difficulty: easy

Interview Round: Technical Screen

## Problem You are given: - A fixed set of **M distinct codes** (strings). The **column order** of the output matrix is the **lexicographic order** of these codes. - A stream of input **batches**. Each batch is a `vector<tuple<int timestamp, string code, int value>>`. Conceptually, the ideal output is an **N × M** matrix where: - Each **row** corresponds to a timestamp (rows ordered by increasing timestamp). - Each **column** corresponds to a code (columns ordered lexicographically by code). - If a `(timestamp, code)` pair has no record, its cell value is `-1`. ### Input stream guarantee (base version) Across the entire stream (or equivalently within each batch, with batches emitted in order), records are produced in **non-decreasing timestamp order**, and for the **same timestamp**, in **lexicographic code order**. ### Task Implement a **stream transformer** that consumes the above input stream and produces an output stream of: - `pair<int timestamp, vector<int> row>` where: - `timestamp` is a timestamp present in the overall input (or in a given list of timestamps, if provided). - `row` has length **M**. - `row[i]` is the value for the `i`-th code (in lexicographic order) at that `timestamp`, or `-1` if missing. Your transformer should emit exactly one output row per timestamp (once all records for that timestamp have been consumed). ### Follow-ups 1. **Some batches have codes out of order** (i.e., within a fixed timestamp, the codes are not guaranteed to be lexicographically sorted). How would you adapt the transformer? 2. **A small fraction of records have timestamps out of order** (i.e., the global timestamp ordering is slightly violated). How would you adapt the transformer while keeping memory bounded and latency reasonable? ### Clarifications / assumptions you may choose (state them clearly) - Whether the list of all M codes is given up front, or must be discovered from the stream. - Whether multiple records for the same `(timestamp, code)` can appear (and if so, whether to take last-write-wins, sum, or treat as invalid). ## Output example (illustrative) If codes are `["A","B","C"]` and the stream contains: - (1,"A",10), (1,"C",30) - (2,"B",20) Then output rows: - (1, [10, -1, 30]) - (2, [-1, 20, -1])

Quick Answer: This question evaluates proficiency in streaming data transformation and time-series alignment, including handling sparse-to-dense mapping, ordering guarantees, and memory/latency trade-offs.

Part 1: Base transformer for ordered timestamps and ordered codes

You are given a fixed set of distinct code strings and a stream of input batches. Each record is a tuple (timestamp, code, value). Across the whole stream, timestamps are in non-decreasing order, and for the same timestamp the codes appear in lexicographic order. Build dense rows in lexicographic code-column order and return exactly one row for each timestamp that appears in the stream. Missing cells must be filled with -1. Assume the code list is given up front and each (timestamp, code) appears at most once.

Constraints

  • 0 <= len(codes) <= 10^4 and all codes are distinct.
  • 0 <= total number of records across all batches <= 2 * 10^5.
  • Every record code belongs to codes.
  • Timestamps are globally non-decreasing, and codes are lexicographically sorted within each timestamp.
  • Each (timestamp, code) appears at most once.

Examples

Input: (['C', 'A', 'B'], [[(1, 'A', 10), (1, 'C', 30)], [(2, 'B', 20)]])

Expected Output: [(1, [10, -1, 30]), (2, [-1, 20, -1])]

Explanation: The input code list is unsorted, but output columns must follow sorted order ['A', 'B', 'C'].

Input: (['A', 'B'], [[(1, 'A', 5)], [(1, 'B', 7), (2, 'A', 9)]])

Expected Output: [(1, [5, 7]), (2, [9, -1])]

Explanation: A timestamp can continue in the next batch; do not emit a row just because a batch ends.

Input: (['A', 'B'], [[], []])

Expected Output: []

Explanation: Edge case: no records at all, so no timestamps are emitted.

Input: (['Z'], [[(42, 'Z', 100)]])

Expected Output: [(42, [100])]

Explanation: Edge case: one code and one record.

Solution

def solution(codes, batches):
    sorted_codes = sorted(codes)
    m = len(sorted_codes)
    result = []
    current_ts = None
    current_row = None
    code_ptr = 0

    for batch in batches:
        for ts, code, value in batch:
            if current_ts is None or ts != current_ts:
                if current_ts is not None:
                    while code_ptr < m:
                        current_row.append(-1)
                        code_ptr += 1
                    result.append((current_ts, current_row))
                current_ts = ts
                current_row = []
                code_ptr = 0

            while code_ptr < m and sorted_codes[code_ptr] < code:
                current_row.append(-1)
                code_ptr += 1

            if code_ptr < m and sorted_codes[code_ptr] == code:
                current_row.append(value)
                code_ptr += 1
            else:
                raise ValueError('Record code not present in codes list or input order invalid')

    if current_ts is not None:
        while code_ptr < m:
            current_row.append(-1)
            code_ptr += 1
        result.append((current_ts, current_row))

    return result

Time complexity: O(M log M + T * M + R), where M is the number of codes, T is the number of emitted timestamps, and R is the number of input records.. Space complexity: O(M) auxiliary space, excluding the returned output..

Hints

  1. You only need to keep one row in memory at a time: the current timestamp's row.
  2. Because codes for the same timestamp are already sorted, you can walk through the sorted code list with a pointer and fill missing positions with -1.

Part 2: Dense rows when codes inside a timestamp are unsorted

You are given a fixed set of distinct code strings and a stream of input batches. Each record is a tuple (timestamp, code, value). Timestamps are still globally non-decreasing, so all records for the same timestamp remain contiguous in the stream, but the codes inside one timestamp may arrive in any order. Build dense rows in lexicographic code-column order and return one row per timestamp. Missing cells are -1. If the same (timestamp, code) appears multiple times, use last-write-wins.

Constraints

  • 0 <= len(codes) <= 10^4 and all codes are distinct.
  • 0 <= total number of records across all batches <= 2 * 10^5.
  • Every record code belongs to codes.
  • Timestamps are globally non-decreasing, so records for one timestamp stay contiguous.
  • Codes within a timestamp may appear in any order; repeated (timestamp, code) pairs use last-write-wins.

Examples

Input: (['C', 'A', 'B'], [[(1, 'C', 30), (1, 'A', 10), (1, 'B', 20)], [(2, 'B', 5), (2, 'B', 8)]])

Expected Output: [(1, [10, 20, 30]), (2, [-1, 8, -1])]

Explanation: Codes arrive out of order for timestamp 1, and timestamp 2 shows last-write-wins for duplicate code B.

Input: (['B', 'A'], [[(4, 'B', 2)], [(4, 'A', 1), (5, 'B', 3)]])

Expected Output: [(4, [1, 2]), (5, [-1, 3])]

Explanation: A single timestamp can span multiple batches even when code order is arbitrary.

Input: (['X', 'Y'], [[(7, 'Y', 11)]])

Expected Output: [(7, [-1, 11])]

Explanation: Edge case: a single timestamp with one missing code.

Input: (['A'], [])

Expected Output: []

Explanation: Edge case: no batches.

Solution

def solution(codes, batches):
    sorted_codes = sorted(codes)
    index = {code: i for i, code in enumerate(sorted_codes)}
    m = len(sorted_codes)
    result = []
    current_ts = None
    current_row = None

    for batch in batches:
        for ts, code, value in batch:
            if current_ts is None or ts != current_ts:
                if current_ts is not None:
                    result.append((current_ts, current_row))
                current_ts = ts
                current_row = [-1] * m

            current_row[index[code]] = value

    if current_ts is not None:
        result.append((current_ts, current_row))

    return result

Time complexity: O(M log M + T * M + R), where M is the number of codes, T is the number of emitted timestamps, and R is the number of input records.. Space complexity: O(M) auxiliary space, excluding the returned output..

Hints

  1. The timestamp order guarantee still lets you keep only one active row at a time.
  2. Precompute a dictionary from code to column index so you can write directly into the correct cell even when codes arrive out of order.

Part 3: Bounded-disorder transformer for slightly out-of-order timestamps

You are given a fixed set of distinct code strings and a stream of input batches. Each record is a tuple (timestamp, code, value). Timestamps may arrive slightly out of order, but the disorder is bounded by a nonnegative integer max_lag: when a record arrives, its timestamp is always at least max_seen_so_far - max_lag. Codes inside the same timestamp may appear in any order, and repeated (timestamp, code) pairs use last-write-wins. For this problem, emission checkpoints are batch boundaries: after finishing a batch, emit every timestamp t that is guaranteed final, meaning t <= max_seen - max_lag. Also emit all remaining timestamps in order at end-of-stream. Return what is emitted after each batch, plus one extra final flush list.

Constraints

  • 0 <= len(codes) <= 10^4 and all codes are distinct.
  • 0 <= total number of records across all batches <= 2 * 10^5.
  • 0 <= max_lag.
  • Every record code belongs to codes. Codes within a timestamp may be in any order, and repeated (timestamp, code) pairs use last-write-wins.
  • Bounded lateness guarantee: each arriving record satisfies timestamp >= max_seen_so_far - max_lag.

Examples

Input: (['A', 'B', 'C'], [[(3, 'A', 30)], [(1, 'B', 10), (4, 'C', 40)], [(2, 'A', 20), (5, 'B', 50)]], 2)

Expected Output: [[], [(1, [-1, 10, -1])], [(2, [20, -1, -1]), (3, [30, -1, -1])], [(4, [-1, -1, 40]), (5, [-1, 50, -1])]]

Explanation: After batch 2, timestamp 1 becomes safe. After batch 3, timestamps 2 and 3 become safe. The last list is the end-of-stream flush.

Input: (['A', 'B'], [[(1, 'A', 7)], [(2, 'B', 9)]], 0)

Expected Output: [[(1, [7, -1])], [(2, [-1, 9])], []]

Explanation: Edge case: with max_lag = 0, rows become safe immediately after their batch if the stream is ordered.

Input: (['A', 'B'], [[(2, 'A', 5)], [(1, 'B', 7), (2, 'B', 8)], [(2, 'A', 6), (3, 'A', 9)]], 1)

Expected Output: [[], [(1, [-1, 7])], [(2, [6, 8])], [(3, [9, -1])]]

Explanation: Timestamp 2 stays active long enough to receive late updates, and last-write-wins changes A from 5 to 6 before emission.

Input: (['A'], [], 3)

Expected Output: [[]]

Explanation: Edge case: no batches means only the final flush list exists, and it is empty.

Solution

def solution(codes, batches, max_lag):
    from heapq import heappush, heappop

    sorted_codes = sorted(codes)
    index = {code: i for i, code in enumerate(sorted_codes)}
    m = len(sorted_codes)

    active = {}
    heap = []
    emitted_per_batch = []
    max_seen = None

    def flush_safe(cutoff):
        out = []
        while heap and heap[0] <= cutoff:
            ts = heappop(heap)
            out.append((ts, active.pop(ts)))
        return out

    for batch in batches:
        for ts, code, value in batch:
            if max_seen is None or ts > max_seen:
                max_seen = ts

            if ts not in active:
                active[ts] = [-1] * m
                heappush(heap, ts)

            active[ts][index[code]] = value

        cutoff = float('-inf') if max_seen is None else max_seen - max_lag
        emitted_per_batch.append(flush_safe(cutoff))

    final_flush = []
    while heap:
        ts = heappop(heap)
        final_flush.append((ts, active.pop(ts)))
    emitted_per_batch.append(final_flush)

    return emitted_per_batch

Time complexity: O(M log M + R + U log U + T * M), where M is the number of codes, R is the number of input records, U is the number of distinct timestamps that ever become active, and T is the number of emitted timestamps.. Space complexity: O(A * M + A + M), where A is the maximum number of active timestamps kept before they become safe to emit..

Hints

  1. Track the largest timestamp seen so far; then max_seen - max_lag acts like a watermark for what is safe to emit.
  2. Use a dictionary for active rows and a min-heap of active timestamps so you can flush the smallest safe timestamps in order.
Last updated: Apr 19, 2026

Loading coding console...

PracHub

Master your tech interviews with 7,500+ real questions from top companies.

Product

  • Questions
  • Learning Tracks
  • Interview Guides
  • Resources
  • Premium
  • For Universities
  • Student Access

Browse

  • By Company
  • By Role
  • By Category
  • Topic Hubs
  • SQL Questions
  • Compare Platforms
  • Discord Community

Support

  • support@prachub.com
  • (916) 541-4762

Legal

  • Privacy Policy
  • Terms of Service
  • About Us

© 2026 PracHub. All rights reserved.

Related Coding Questions

  • Optimize trade PnL table updates - Jane Street (hard)
  • Sort trade executions into a canonical order - Jane Street (easy)
  • Solve queue switch and coin change puzzles - Jane Street (medium)
  • Implement Connect Four with bottom push-up - Jane Street (Medium)