Algorithm Prompt: String Compression and Decompression with Run-Length Encoding
Design algorithms to compress and decompress a string using run-length encoding (RLE). For example, if the input is "aaabbc", a simple encoding can produce "a3b2c1".
Constraints & Assumptions
-
For the simple interview version, assume input characters are lowercase English letters, so
character + decimal count
is unambiguous.
-
Always include the count, even when it is 1.
-
Counts are positive base-10 integers.
-
If the input may contain arbitrary characters including digits, the simple
character + count
format is ambiguous; propose a delimiter, escaping, or length-prefixed format.
-
Very long inputs and runs should be handled without integer overflow in languages with fixed-size integers.
Clarifying Questions to Ask
-
What character set can the input contain?
-
Is the output format fixed as
character + count
, or can I choose a safer format?
-
Should compression return the encoded string even if it is longer than the original?
-
Do we need streaming encode/decode for very large inputs?
-
How should invalid compressed input be handled?
Part 1 - Compression
Define the compression algorithm.
What This Part Should Cover
-
Scan the input once while tracking the current character and run length.
-
Flush a run when the character changes.
-
Handle empty input and single-character runs.
-
Use an output buffer rather than repeated string concatenation.
-
Discuss whether to return the original string if compression does not help.
Part 2 - Decompression
Define the decompression algorithm.
What This Part Should Cover
-
Parse one character and then one or more digits as the count for the simple lowercase-letter format.
-
Validate malformed inputs, missing counts, and zero or negative counts.
-
Expand the character count times.
-
Mention how decoding changes if arbitrary characters require delimiters or escaping.
Part 3 - Complexity and Edge Cases
Discuss edge cases and Big-O complexity.
What This Part Should Cover
-
Empty string, single-character strings, all unique characters, all same characters, and very long runs.
-
Cases where encoded output is longer.
-
Time complexity for encode and decode.
-
Space complexity in terms of output size.
-
Ambiguity of digit-containing input under the naive format.
What a Strong Answer Covers
A strong answer gives correct encode/decode procedures, names the ambiguity in naive RLE when input can contain digits, chooses or clarifies a safe format, and analyzes complexity in terms of input and output sizes.
Follow-up Questions
-
How would you support streaming compression?
-
How would you design an unambiguous format for arbitrary Unicode input?
-
What if the count is extremely large during decoding?
-
When should a compressor return the original string?
-
How would you validate compressed input safely?