Implement XML tokenizer and parser with operations
Company: Snapchat
Role: Software Engineer
Category: Coding & Algorithms
Difficulty: Medium
Interview Round: Technical Screen
You are given either
(a) a raw XML-like string such as <catalog><book><author>Gambardella, Matthew</author></book></catalog> or
(b) its tokenized form as a list of dictionaries like [{'text': 'catalog', 'token_type': 'open_tag'}, {'text': 'book', 'token_type': 'open_tag'}, {'text': 'author', 'token_type': 'open_tag'}, {'text': 'Gambardella, Matthew', 'token_type': 'raw_text'}, {'text': 'author', 'token_type': 'close_tag'}, {'text': 'book', 'token_type': 'close_tag'}, {'text': 'catalog', 'token_type': 'close_tag'}]. Implement:
1) tokenize(xml_str) -> list[dict] that emits tokens where token_type ∈ {open_tag, close_tag, raw_text};
2) class XMLParser with __init__(tokens: list[dict]) that validates the structure and raises an exception for malformed input;
3) to_string()/__str__() that reconstructs the original XML;
4) add_element(path: list[str], tag: str, text: str|null, index: int|null) to insert a new element under the node identified by path;
5) remove_element(path: list[str]) to delete a node;
6) traverse_iterative() that performs an iterative DFS (no recursion) and yields nodes in preorder. Constraints and requirements: use a linear scan with a stack for validation; overall validation should be O
(n) time and O
(h) space where h is tree height; tags have no attributes and text may contain any characters except '<' and '>'; handle edge cases such as mismatched or out-of-order closing tags, unclosed tags at EOF, empty token lists, and extraneous raw_text between sibling tags. Describe your data structures, algorithms, and time/space complexity for each method.
Quick Answer: This interview question evaluates algorithm design, data structures, correctness, complexity, edge cases, and implementation details in a realistic interview setting. A strong answer for Implement XML tokenizer and parser with operations states assumptions, handles edge cases, explains trade-offs, and shows how to validate the result clearly.
Solution
# Solution Alignment
The prompt asks for an implementation-level answer. The safest way to present it is to define the state, maintain clear invariants, then walk through complexity and tests.
## Problem Restatement
You are given either (a) a raw XML-like string such as <catalog><book><author>Gambardella, Matthew</author></book></catalog> or (b) its tokenized form as a list of dictionaries like [{'text': 'catalog', 'token_type': 'open_tag'}, {'text': 'book', 'token_type': 'open_tag'}, {'text': 'author', 'token_type': 'open_tag'}, {'text': 'Gambardella, Matthew', 'token_type': 'raw_text'}, {'text': 'author', 'token_type': 'close_tag'}, {'text': 'book', 'token_type': 'close_tag'}, {'text': 'catalog', 'token_type': 'close_tag'}]. Implement: 1) tokenize(xml_str) -> list[dict] that emits tokens where token_type ∈ {open_tag, close_tag, raw_text}; 2) class XMLParser with __init__(tokens: list[dict]) that valid...
## Recommended Approach
Choose traversal based on the required output. DFS is natural for subtree computations, reconstruction, and range pruning; BFS is natural for level order or side views. Keep per-depth or per-position state when the output depends on columns, rows, or depths.
## Correctness
The implementation should maintain an invariant after each loop or operation that directly matches the problem statement. At termination, that invariant implies the returned value has considered every valid candidate exactly once, or has preserved the required data-structure state after every API call.
## Complexity
Most tree traversals are O(n) time and O(h) recursion stack for DFS or O(w) queue space for BFS, where h is height and w is maximum width.
## Edge Cases and Tests
Empty tree, one node, skewed tree, duplicate values when reconstruction assumes uniqueness, deep recursion, and tie-breaking for same row/column nodes.