Implement binary encode/decode with frequency tree
Company: Two Sigma
Role: Software Engineer
Category: Coding & Algorithms
Difficulty: hard
Interview Round: Technical Screen
## Problem
Implement a **binary encoding** and **decoding** scheme for strings based on character frequencies (Huffman-like coding).
You must implement two functions:
1. `encode(s)` → returns `(bitstring, tree)`
2. `decode(bitstring, tree)` → returns the original string
### Encoding rules
Given an input string `s`:
1. Count the frequency of each character.
2. Create a leaf node for each distinct character with weight = its frequency.
3. Build a binary tree by repeatedly combining the **two nodes with the smallest weights** into a new parent node:
- Parent weight = sum of the two child weights.
- The two chosen nodes become the parent’s left and right children.
4. Assign edge labels:
- left edge = `0`
- right edge = `1`
5. The **binary representation** (code) of a character is the sequence of edge labels along the path from the root to that character’s leaf.
6. The encoded output `bitstring` is the concatenation of the codes for each character in `s`.
### Determinism / tie-breaking
To make the encoding deterministic when multiple nodes have equal weight, use this tie-break rule:
- When choosing the two minimum-weight nodes, sort candidates by `(weight, key)`.
- For a **leaf**, `key = character` (lexicographic order).
- For an **internal node**, `key = the smallest character contained in its subtree`.
- When attaching the two chosen nodes as children, the smaller `(weight, key)` becomes the **left** child (gets edge `0`).
### Decoding rules
Given `(bitstring, tree)`, decode by traversing the tree from the root:
- Read bits left-to-right.
- Bit `0` means go to the left child; bit `1` means go to the right child.
- When you reach a leaf, output its character and reset to the root.
### Inputs / outputs
- Input string `s` consists of arbitrary ASCII characters.
- Return the encoded bitstring as a string of `'0'` and `'1'`.
### Edge cases
- If `s` is empty, encoding returns `("", null)` and decoding returns `""`.
- If `s` contains only one distinct character (e.g., `"aaaa"`), assign that character the code `"0"` (so encoding is `"0000"`).
### Constraints
- `0 ≤ |s| ≤ 2 * 10^5`
- Number of distinct characters `σ ≤ 256`
- Aim for `O(|s| + σ log σ)` time and `O(σ)` extra space (excluding output).
Quick Answer: This question evaluates understanding of frequency-based binary encoding and decoding, including deterministic tree construction, tie-breaking rules, and traversal, and tests knowledge of trees, priority queues, greedy algorithms, and bit-level string manipulation in the Coding & Algorithms domain.
Implement a deterministic Huffman-like binary encoding and decoding scheme for strings. For this platform, implement solution(s), which should build the frequency tree, encode s into a bitstring, decode that bitstring using the same tree, and return [bitstring, decoded_string]. To build the tree, count character frequencies, create one leaf per distinct character, then repeatedly combine the two nodes with the smallest (weight, key), where a leaf key is its character and an internal-node key is the smallest character in its subtree. The smaller chosen node becomes the left child and receives edge 0; the other becomes the right child and receives edge 1. A character's code is the path from root to its leaf. If s is empty, the encoded bitstring is empty and the decoded string is empty. If s has only one distinct character, that character's code is 0.
Constraints
- 0 <= len(s) <= 2 * 10^5
- Number of distinct characters sigma <= 256
- Characters are arbitrary ASCII characters
- Tie-breaking must use (weight, key), where key is the smallest character in the node's subtree
- Aim for O(|s| + sigma log sigma) time excluding output size
Examples
Input: ("",)
Expected Output: ["", ""]
Explanation: The empty string produces an empty encoded bitstring and decodes back to the empty string.
Input: ("aaaa",)
Expected Output: ["0000", "aaaa"]
Explanation: There is only one distinct character, so 'a' is assigned code '0'. Four occurrences encode to '0000'.
Input: ("ab",)
Expected Output: ["01", "ab"]
Explanation: 'a' and 'b' have equal frequency. Lexicographic tie-breaking puts 'a' on the left with code 0 and 'b' on the right with code 1.
Input: ("abbccc",)
Expected Output: ["000101111", "abbccc"]
Explanation: Frequencies are a:1, b:2, c:3. Combine a and b first; then combine that internal node with c. Codes are a=00, b=01, c=1.
Input: ("abcabc",)
Expected Output: ["1011010110", "abcabc"]
Explanation: All characters have frequency 2. Combine a and b first into an internal node of weight 4; c has weight 2 and becomes the left child of the root. Codes are c=0, a=10, b=11.
Input: ("A!A!",)
Expected Output: ["1010", "A!A!"]
Explanation: '!' and 'A' both have frequency 2. ASCII lexicographic order puts '!' before 'A', so !=0 and A=1.
Input: ("aaabbc",)
Expected Output: ["000111110", "aaabbc"]
Explanation: Frequencies are a:3, b:2, c:1. First combine c and b into an internal node with key b, then tie-break against a by key. Codes are a=0, c=10, b=11.
Hints
- Use a min-heap ordered by (weight, key) to repeatedly choose the two nodes to combine.
- After the tree is built, run a DFS to create a character-to-code map. Remember to special-case the empty string and the single-distinct-character tree.