PracHub
QuestionsPremiumLearningGuidesInterview PrepNEWCoaches

Quick Overview

This question evaluates understanding of data compression algorithms, bit-level manipulation, two's‑complement integer representation, streaming algorithm design, and encoder/decoder state management for 32-bit signed integers.

  • hard
  • Databricks
  • Coding & Algorithms
  • Software Engineer

Implement streaming RLE and bit-packed codec

Company: Databricks

Role: Software Engineer

Category: Coding & Algorithms

Difficulty: hard

Interview Round: Onsite

You are implementing a simple compression scheme for sequences of 32‑bit signed integers. The codec should support **two encoding strategies**: 1. **Run‑Length Encoding (RLE)** for long runs of equal values. 2. **Bit‑Packed Encoding (BP)** for blocks of values that can be represented with a small, uniform bit‑width. The codec is **streaming**: values arrive one by one, and the encoder should buffer them into blocks and choose an encoding strategy per block. ### Compressed representation Design a compressed representation made of **blocks**. For concreteness, define: - **RLE block** - Represents a run of a single repeated integer value. - Fields: - `type = 'R'` - `value` (int32) - `count` (int, number of repetitions) - Only use RLE blocks when `count >= RLE_MIN_RUN` (e.g., `RLE_MIN_RUN = 3`). Shorter runs should not be encoded as RLE. - **Bit‑packed block** - Represents a sequence of `count` possibly different integers, all of which fit into `bitWidth` bits in two's‑complement representation. - Fields: - `type = 'B'` - `bitWidth` (1–32) - `count` (number of values) - A packed payload containing exactly `count` values, each stored using `bitWidth` bits. - You may choose a **maximum block size** `MAX_BP_BLOCK` (e.g., 128 values) for simplicity. You can decide the in‑memory representation of a bit‑packed payload (e.g., array of 32‑bit integers where bits are tightly packed), as long as the decoder can reconstruct the original sequence exactly. ### Encoder Implement an `Encoder` class with the following behavior: - It receives input values **one at a time** via a method like: ```text void add(int value) ``` - Internally, the encoder may maintain a buffer of recent values to decide whether to form an RLE block or a bit‑packed block. - At appropriate times (e.g., when a block is full, when the encoding strategy should change, or when `flush()` is called), it should **emit blocks**. - Provide a method: ```text List<Block> flush() ``` that finalizes the stream, closes any open block, and returns the list of compressed blocks. **Encoding strategy constraints:** - For a maximal run of the same value with length `L`: - If `L >= RLE_MIN_RUN`, you should encode it as a single RLE block. - Otherwise, those values should be part of a bit‑packed block. - For bit‑packed blocks: - You may group consecutive non‑RLE values into blocks up to size `MAX_BP_BLOCK`. - Choose `bitWidth` for a block as the minimum number of bits needed to represent **all** its values. - You do **not** need to prove global optimality of the compression, but your encoder must consistently follow the above rules. The encoder must correctly handle: - Negative values and the full range of 32‑bit signed integers (`Integer.MIN_VALUE` to `Integer.MAX_VALUE`). - Transitions between RLE and bit‑packed segments in the stream. ### Decoder Implement a corresponding `Decoder` that reconstructs the original integer sequence from a list of blocks. - Its constructor receives the list of blocks produced by the encoder: ```text class Decoder implements Iterator<Integer> { Decoder(List<Block> blocks) { ... } boolean hasNext(); int next(); } ``` - `hasNext()` / `next()` should expose the **original sequence of integers** in order, exactly as they were passed to `Encoder.add(...)`. - The decoder must correctly iterate across both RLE and bit‑packed blocks. ### Tasks 1. Specify the exact in‑memory structure you will use for `Block` and the bit‑packed payload. 2. Implement `Encoder.add(value)` and `Encoder.flush()` to produce a valid block sequence. 3. Implement `Decoder` as an iterator over the decompressed values. 4. Write unit tests for cases including: - Simple sequences without repeats (forces bit‑packing). - Long runs of a single value (forces RLE). - Alternating patterns (switching between RLE and BP). - Values near `Integer.MAX_VALUE`, `Integer.MIN_VALUE`, and negative numbers. - Empty input sequence. Assume you are working in an object‑oriented language (e.g., Java, C++, or similar) and focus on clean class design and correctness of the codec pair.

Quick Answer: This question evaluates understanding of data compression algorithms, bit-level manipulation, two's‑complement integer representation, streaming algorithm design, and encoder/decoder state management for 32-bit signed integers.

Part 1: Build a canonical block representation

Implement a function that creates one compression block in a canonical Python in-memory format. Use these exact dictionary shapes: - RLE block: {'type': 'R', 'value': value, 'count': count} - Bit-packed block: {'type': 'B', 'bitWidth': w, 'count': n, 'words': words} For a bit-packed block: 1. Compute the minimum signed two's-complement bit width w that can represent every input value. 2. Pack the values tightly into unsigned 32-bit words. 3. Packing order is canonical: the first value occupies the lowest w bits of words[0], the second starts immediately after it, and so on. 4. If a value crosses a 32-bit boundary, continue its remaining high bits in the next word. For an empty bit-packed input, return {'type': 'B', 'bitWidth': 0, 'count': 0, 'words': []}.

Constraints

  • block_type is either 'R' or 'B'.
  • All integers are in the signed 32-bit range [-2147483648, 2147483647].
  • For an RLE block, count >= 1.
  • For a bit-packed block, 0 <= len(data) <= 10^4.

Examples

Input: ('R', (5, 4))

Expected Output: {'type': 'R', 'value': 5, 'count': 4}

Explanation: A run-length block is stored directly.

Input: ('B', [0, 1, -1])

Expected Output: {'type': 'B', 'bitWidth': 2, 'count': 3, 'words': [52]}

Explanation: The minimum common width is 2 bits: 0 -> 00, 1 -> 01, -1 -> 11. Packed word is 0b110100 = 52.

Input: ('B', [-2147483648, 2147483647])

Expected Output: {'type': 'B', 'bitWidth': 32, 'count': 2, 'words': [2147483648, 2147483647]}

Explanation: The full 32-bit signed range requires width 32, so each value occupies one full word.

Input: ('B', [])

Expected Output: {'type': 'B', 'bitWidth': 0, 'count': 0, 'words': []}

Explanation: Edge case: empty bit-packed input.

Input: ('B', [7, 7, 7, 7, 7, 7, 7, 7, 7])

Expected Output: {'type': 'B', 'bitWidth': 4, 'count': 9, 'words': [2004318071, 7]}

Explanation: Nine 4-bit values need 36 bits, so the payload spills into a second word.

Solution

def solution(block_type, data):
    def min_bits_signed(x):
        if x >= 0:
            return 1 if x == 0 else x.bit_length() + 1
        return (~x).bit_length() + 1

    def pack(values, w):
        words = []
        current = 0
        used = 0
        mask = (1 << w) - 1
        for v in values:
            u = v & mask
            remaining = w
            while remaining > 0:
                space = 32 - used
                take = remaining if remaining < space else space
                part = u & ((1 << take) - 1)
                current |= part << used
                used += take
                u >>= take
                remaining -= take
                if used == 32:
                    words.append(current & 0xFFFFFFFF)
                    current = 0
                    used = 0
        if used > 0:
            words.append(current & 0xFFFFFFFF)
        return words

    if block_type == 'R':
        value, count = data
        return {'type': 'R', 'value': value, 'count': count}

    values = list(data)
    if not values:
        return {'type': 'B', 'bitWidth': 0, 'count': 0, 'words': []}

    bit_width = max(min_bits_signed(v) for v in values)
    return {'type': 'B', 'bitWidth': bit_width, 'count': len(values), 'words': pack(values, bit_width)}

Time complexity: O(n + number_of_words), which is O(n) because each value uses at most 32 bits. Space complexity: O(number_of_words).

Hints

  1. A signed w-bit two's-complement integer must lie in the range [-2^(w-1), 2^(w-1)-1].
  2. While packing, track the current 32-bit word and the bit offset already used inside it.

Part 2: Encode an integer stream into RLE and bit-packed blocks

Implement the encoder for a simple integer codec. Use the following canonical block formats: - RLE block: {'type': 'R', 'value': value, 'count': count} - Bit-packed block: {'type': 'B', 'bitWidth': w, 'count': n, 'words': words} Encoding rules: 1. Scan the input sequence in order. 2. For every maximal run of equal values with length L: - if L >= rle_min_run, emit exactly one RLE block for that run - otherwise, those values must become part of bit-packed blocks 3. Consecutive non-RLE values are emitted as bit-packed blocks of size at most max_bp_block. 4. Each bit-packed block uses the minimum signed two's-complement width needed by all values in that block. 5. Values are packed tightly into unsigned 32-bit words. The first value uses the lowest w bits of words[0], the next value follows immediately, and so on. Return the full encoded block list.

Constraints

  • 0 <= len(values) <= 10^5
  • 1 <= rle_min_run <= 10
  • 1 <= max_bp_block <= 128
  • All input values are in the signed 32-bit range [-2147483648, 2147483647].

Examples

Input: ([], 3, 128)

Expected Output: []

Explanation: Edge case: empty input produces no blocks.

Input: ([1, 2, 3], 3, 128)

Expected Output: [{'type': 'B', 'bitWidth': 3, 'count': 3, 'words': [209]}]

Explanation: No run is long enough for RLE, so the whole sequence becomes one BP block.

Input: ([7, 7, 7, 7], 3, 128)

Expected Output: [{'type': 'R', 'value': 7, 'count': 4}]

Explanation: A maximal run of length 4 qualifies for RLE.

Input: ([5, 5, 5, 1, 2, 2, 3, 3, 3, 4], 3, 128)

Expected Output: [{'type': 'R', 'value': 5, 'count': 3}, {'type': 'B', 'bitWidth': 3, 'count': 3, 'words': [145]}, {'type': 'R', 'value': 3, 'count': 3}, {'type': 'B', 'bitWidth': 4, 'count': 1, 'words': [4]}]

Explanation: The 5s and 3s become RLE blocks; the short middle segment and final 4 become BP blocks.

Input: ([0, 0, 1, 1, 2, 2], 3, 4)

Expected Output: [{'type': 'B', 'bitWidth': 2, 'count': 4, 'words': [80]}, {'type': 'B', 'bitWidth': 3, 'count': 2, 'words': [18]}]

Explanation: No run reaches length 3, so everything is BP. max_bp_block = 4 forces the data to split into two BP blocks.

Solution

def solution(values, rle_min_run=3, max_bp_block=128):
    from collections import deque

    def min_bits_signed(x):
        if x >= 0:
            return 1 if x == 0 else x.bit_length() + 1
        return (~x).bit_length() + 1

    def pack(vals, w):
        words = []
        current = 0
        used = 0
        mask = (1 << w) - 1
        for v in vals:
            u = v & mask
            remaining = w
            while remaining > 0:
                space = 32 - used
                take = remaining if remaining < space else space
                part = u & ((1 << take) - 1)
                current |= part << used
                used += take
                u >>= take
                remaining -= take
                if used == 32:
                    words.append(current & 0xFFFFFFFF)
                    current = 0
                    used = 0
        if used > 0:
            words.append(current & 0xFFFFFFFF)
        return words

    def make_bp(chunk):
        bit_width = max(min_bits_signed(v) for v in chunk)
        return {'type': 'B', 'bitWidth': bit_width, 'count': len(chunk), 'words': pack(chunk, bit_width)}

    blocks = []
    buffer = deque()

    def flush_full_chunks():
        while len(buffer) >= max_bp_block:
            chunk = [buffer.popleft() for _ in range(max_bp_block)]
            blocks.append(make_bp(chunk))

    def flush_all():
        while buffer:
            size = min(len(buffer), max_bp_block)
            chunk = [buffer.popleft() for _ in range(size)]
            blocks.append(make_bp(chunk))

    i = 0
    n = len(values)
    while i < n:
        j = i + 1
        while j < n and values[j] == values[i]:
            j += 1
        run_len = j - i

        if run_len >= rle_min_run:
            flush_all()
            blocks.append({'type': 'R', 'value': values[i], 'count': run_len})
        else:
            for k in range(i, j):
                buffer.append(values[k])
            flush_full_chunks()

        i = j

    flush_all()
    return blocks

Time complexity: O(n). Space complexity: O(max_bp_block + output_size).

Hints

  1. Think in terms of maximal runs first. A long enough run becomes one R block; a short run is buffered into BP data.
  2. Flush the buffered non-RLE values before emitting an R block, and also whenever the BP buffer reaches max_bp_block.

Part 3: Decode RLE and bit-packed blocks

You are given a list of compressed blocks and must reconstruct the original integer sequence. Block formats: - RLE block: {'type': 'R', 'value': value, 'count': count} - Bit-packed block: {'type': 'B', 'bitWidth': w, 'count': n, 'words': words} For a bit-packed block: - values are stored tightly in unsigned 32-bit words - the first value uses the lowest w bits of words[0] - the next value starts immediately after it - if a value crosses a word boundary, continue reading from the next word - each extracted w-bit pattern is a signed two's-complement integer and must be sign-extended back to a normal Python int Return the fully decompressed list.

Constraints

  • 0 <= len(blocks) <= 10^5
  • Each block is valid and follows the canonical format.
  • For bit-packed blocks, 1 <= bitWidth <= 32 when count > 0.
  • All decoded values fit in the signed 32-bit range.

Examples

Input: []

Expected Output: []

Explanation: Edge case: no blocks means no output values.

Input: [{'type': 'R', 'value': 5, 'count': 4}]

Expected Output: [5, 5, 5, 5]

Explanation: An RLE block expands into repeated values.

Input: [{'type': 'B', 'bitWidth': 3, 'count': 3, 'words': [209]}]

Expected Output: [1, 2, 3]

Explanation: The 3-bit values are packed as 001, 010, 011.

Input: [{'type': 'B', 'bitWidth': 32, 'count': 2, 'words': [2147483647, 2147483648]}, {'type': 'R', 'value': -1, 'count': 2}]

Expected Output: [2147483647, -2147483648, -1, -1]

Explanation: This checks both 32-bit signed decoding and iteration across different block types.

Input: [{'type': 'B', 'bitWidth': 5, 'count': 7, 'words': [4294967295, 7]}]

Expected Output: [-1, -1, -1, -1, -1, -1, -1]

Explanation: Seven 5-bit values cross the 32-bit boundary; every 5-bit pattern is 11111, which is -1.

Solution

def solution(blocks):
    result = []

    for block in blocks:
        if block['type'] == 'R':
            result.extend([block['value']] * block['count'])
            continue

        w = block['bitWidth']
        count = block['count']
        words = block['words']
        if count == 0:
            continue

        mask = (1 << w) - 1
        bit_pos = 0
        for _ in range(count):
            word_index = bit_pos // 32
            offset = bit_pos % 32

            if offset + w <= 32:
                u = (words[word_index] >> offset) & mask
            else:
                low_bits = 32 - offset
                high_bits = w - low_bits
                low = (words[word_index] >> offset) & ((1 << low_bits) - 1)
                high = words[word_index + 1] & ((1 << high_bits) - 1)
                u = low | (high << low_bits)

            sign_bit = 1 << (w - 1)
            if u & sign_bit:
                value = u - (1 << w)
            else:
                value = u
            result.append(value)
            bit_pos += w

    return result

Time complexity: O(total_decoded_values). Space complexity: O(total_decoded_values).

Hints

  1. The i-th packed value starts at bit position i * bitWidth inside that block's bitstream.
  2. After extracting the unsigned w-bit pattern, check its highest bit to decide whether it represents a negative number.

Part 4: Validate codec unit-test coverage

You are reviewing a proposed unit-test suite for the integer codec. Each test is just an input sequence of signed 32-bit integers. Return a summary dictionary with these boolean keys: - 'simple_bp': at least one non-empty sequence whose canonical encoding would use only bit-packed blocks - 'long_run_rle': at least one sequence containing a maximal run of length >= rle_min_run - 'alternating_switch': at least one sequence whose canonical encoding would contain both an RLE block and a bit-packed block - 'has_int_min': at least one occurrence of -2147483648 anywhere in the suite - 'has_int_max': at least one occurrence of 2147483647 anywhere in the suite - 'has_negative': at least one negative value anywhere in the suite - 'empty_input': at least one empty sequence - 'complete': true only if all the previous checks are true A sequence uses only bit-packed blocks if none of its maximal equal-value runs reaches rle_min_run.

Constraints

  • 0 <= number of sequences <= 10^4
  • 0 <= total number of integers across all sequences <= 2 * 10^5
  • 1 <= rle_min_run <= 10
  • All integers are in the signed 32-bit range [-2147483648, 2147483647].

Examples

Input: ([[], [1, 2, 3], [7, 7, 7, 7], [5, 5, 5, 1, 2, 2, 3, 3, 3], [2147483647, -2147483648, -1]], 3)

Expected Output: {'simple_bp': True, 'long_run_rle': True, 'alternating_switch': True, 'has_int_min': True, 'has_int_max': True, 'has_negative': True, 'empty_input': True, 'complete': True}

Explanation: This suite covers every required category.

Input: ([[1, 2], [3, 3, 3]], 3)

Expected Output: {'simple_bp': True, 'long_run_rle': True, 'alternating_switch': False, 'has_int_min': False, 'has_int_max': False, 'has_negative': False, 'empty_input': False, 'complete': False}

Explanation: The suite has BP-only and RLE-only cases, but no empty case, no boundary values, and no single sequence that switches strategies.

Input: ([[]], 3)

Expected Output: {'simple_bp': False, 'long_run_rle': False, 'alternating_switch': False, 'has_int_min': False, 'has_int_max': False, 'has_negative': False, 'empty_input': True, 'complete': False}

Explanation: Edge case: only the empty-input requirement is satisfied.

Input: ([[-1, -2], [4, 4, 4, 5, 6, 6, 6], [2147483647]], 3)

Expected Output: {'simple_bp': True, 'long_run_rle': True, 'alternating_switch': True, 'has_int_min': False, 'has_int_max': True, 'has_negative': True, 'empty_input': False, 'complete': False}

Explanation: This suite is close, but it still misses INT_MIN and an empty test.

Input: ([[], [1, 2], [3, 3, 3], [2147483647, -2147483648, -1]], 3)

Expected Output: {'simple_bp': True, 'long_run_rle': True, 'alternating_switch': False, 'has_int_min': True, 'has_int_max': True, 'has_negative': True, 'empty_input': True, 'complete': False}

Explanation: Even with all other categories present, alternating_switch is still missing because no single sequence uses both block types.

Solution

def solution(test_sequences, rle_min_run=3):
    INT_MIN = -2147483648
    INT_MAX = 2147483647

    def analyze_sequence(seq):
        has_r = False
        has_b = False
        i = 0
        n = len(seq)
        while i < n:
            j = i + 1
            while j < n and seq[j] == seq[i]:
                j += 1
            if j - i >= rle_min_run:
                has_r = True
            else:
                has_b = True
            i = j
        return has_r, has_b

    summary = {
        'simple_bp': False,
        'long_run_rle': False,
        'alternating_switch': False,
        'has_int_min': False,
        'has_int_max': False,
        'has_negative': False,
        'empty_input': False,
        'complete': False,
    }

    for seq in test_sequences:
        if not seq:
            summary['empty_input'] = True
        if any(v == INT_MIN for v in seq):
            summary['has_int_min'] = True
        if any(v == INT_MAX for v in seq):
            summary['has_int_max'] = True
        if any(v < 0 for v in seq):
            summary['has_negative'] = True

        has_r, has_b = analyze_sequence(seq)
        if seq and not has_r:
            summary['simple_bp'] = True
        if has_r:
            summary['long_run_rle'] = True
        if has_r and has_b:
            summary['alternating_switch'] = True

    summary['complete'] = all([
        summary['simple_bp'],
        summary['long_run_rle'],
        summary['alternating_switch'],
        summary['has_int_min'],
        summary['has_int_max'],
        summary['has_negative'],
        summary['empty_input'],
    ])
    return summary

Time complexity: O(total number of integers across the whole suite). Space complexity: O(1) extra space, excluding the returned dictionary.

Hints

  1. A sequence has an RLE block iff at least one maximal run is long enough.
  2. alternating_switch is stronger than having some BP-only tests and some RLE-only tests; one single sequence must produce both block types.
Last updated: Apr 19, 2026

Loading coding console...

PracHub

Master your tech interviews with 7,500+ real questions from top companies.

Product

  • Questions
  • Learning Tracks
  • Interview Guides
  • Resources
  • Premium
  • For Universities
  • Student Access

Browse

  • By Company
  • By Role
  • By Category
  • Topic Hubs
  • SQL Questions
  • Compare Platforms
  • Discord Community

Support

  • support@prachub.com
  • (916) 541-4762

Legal

  • Privacy Policy
  • Terms of Service
  • About Us

© 2026 PracHub. All rights reserved.

Related Coding Questions

  • Find the Best Commute Mode - Databricks (medium)
  • Partition a Target String by Source Substrings - Databricks (medium)
  • Design In-Memory QPS Counter - Databricks (medium)
  • Implement Random Connectivity and Grid Routing - Databricks
  • Implement a Snapshot Set Iterator - Databricks (hard)