Design a parser for a hypothetical language
Company: OpenAI
Role: Software Engineer
Category: Coding & Algorithms
Difficulty: Medium
Interview Round: Onsite
Quick Answer: This question evaluates competency in compiler and language design, covering grammar analysis, parsing algorithms (LL/LR/GLR/Earley/PEG), lexer and parser implementation, AST construction, error reporting and recovery, operator precedence and associativity, and analysis of time/space trade-offs.
Part 1: Parse a Grammar Specification into an Internal Representation
Constraints
- 0 <= number of lines <= 200
- Each production contains exactly one ->
- Symbols are whitespace-separated and epsilon is represented by 'ε'
Examples
Input: ['%left + -', '%left * /', 'Expr -> Expr + Term | Expr - Term | Term', 'Term -> Term * Factor | Term / Factor | Factor', 'Factor -> num | ( Expr )']
Expected Output: {'nonterminals': ['Expr', 'Factor', 'Term'], 'terminals': ['(', ')', '*', '+', '-', '/', 'num'], 'productions': [{'lhs': 'Expr', 'rhs': ['Expr', '+', 'Term']}, {'lhs': 'Expr', 'rhs': ['Expr', '-', 'Term']}, {'lhs': 'Expr', 'rhs': ['Term']}, {'lhs': 'Term', 'rhs': ['Term', '*', 'Factor']}, {'lhs': 'Term', 'rhs': ['Term', '/', 'Factor']}, {'lhs': 'Term', 'rhs': ['Factor']}, {'lhs': 'Factor', 'rhs': ['num']}, {'lhs': 'Factor', 'rhs': ['(', 'Expr', ')']}], 'precedence': {'+': {'assoc': 'left', 'level': 1}, '-': {'assoc': 'left', 'level': 1}, '*': {'assoc': 'left', 'level': 2}, '/': {'assoc': 'left', 'level': 2}}}
Explanation: This includes both directives and production rules.
Input: []
Expected Output: {'nonterminals': [], 'terminals': [], 'productions': [], 'precedence': {}}
Explanation: Edge case: empty specification.
Input: ['S -> ε']
Expected Output: {'nonterminals': ['S'], 'terminals': [], 'productions': [{'lhs': 'S', 'rhs': ['ε']}], 'precedence': {}}
Explanation: A grammar with only epsilon has no terminals.
Input: ['%right ^', 'Atom -> id | num', 'Pow -> Atom ^ Pow | Atom']
Expected Output: {'nonterminals': ['Atom', 'Pow'], 'terminals': ['^', 'id', 'num'], 'productions': [{'lhs': 'Atom', 'rhs': ['id']}, {'lhs': 'Atom', 'rhs': ['num']}, {'lhs': 'Pow', 'rhs': ['Atom', '^', 'Pow']}, {'lhs': 'Pow', 'rhs': ['Atom']}], 'precedence': {'^': {'assoc': 'right', 'level': 1}}}
Explanation: A single precedence declaration assigns level 1.
Hints
- First collect every left-hand side before classifying terminals.
- Flatten alternatives after splitting each right-hand side on |.
Part 2: Detect Left Recursion, Unreachable Symbols, Non-Productive Symbols, and FIRST-Set Conflicts
Constraints
- 1 <= number of nonterminals <= 100
- Each production is a list of strings
- The start symbol always exists in the grammar
Examples
Input: ('S', {'S': [['S', 'a'], ['A'], ['b']], 'A': [['ε'], ['c']], 'B': [['d']]})
Expected Output: {'left_recursive': ['S'], 'unreachable': ['B'], 'non_productive': [], 'ambiguous_nonterminals': ['S']}
Explanation: S is directly left-recursive through S -> S a. B is never reached from the start symbol. All nonterminals are productive. S is ambiguous under the heuristic because FIRST(S a) overlaps with FIRST(A) and FIRST(b).
Input: ('S', {'S': [['A', 'm']], 'A': [['B']], 'B': [['S'], ['t'], ['t', 'u']], 'C': [['D']], 'D': [['C']]})
Expected Output: {'left_recursive': ['A', 'B', 'S'], 'unreachable': ['C', 'D'], 'non_productive': ['C', 'D'], 'ambiguous_nonterminals': ['B']}
Explanation: S, A, and B form a productive left-recursive cycle. C and D form a dead cycle, so they are unreachable and non-productive, but they should not be reported as left-recursive for this problem. B is ambiguous because FIRST(S), FIRST(t), and FIRST(t u) all contain 't'.
Input: ('S', {'S': [['A']], 'A': [['B']], 'B': [['C']], 'C': [['C']]})
Expected Output: {'left_recursive': [], 'unreachable': [], 'non_productive': ['A', 'B', 'C', 'S'], 'ambiguous_nonterminals': []}
Explanation: Every nonterminal is reachable from S, but none can derive any terminal string. The self-cycle at C is a dead cycle, so no nonterminal is reported as left-recursive under the productive-left-recursion rule.
Input: ('S', {'S': [['ε']]})
Expected Output: {'left_recursive': [], 'unreachable': [], 'non_productive': [], 'ambiguous_nonterminals': []}
Explanation: The grammar contains a single productive epsilon production. There is no recursion, no unreachable symbol, and only one production, so there is no FIRST-set conflict.
Hints
- Unreachable symbols come from a graph traversal starting at the start symbol.
- A productive nonterminal is one that can eventually derive only terminals and/or ε.
Part 3: Transform a Grammar by Eliminating Direct Left Recursion and Left-Factoring
Constraints
- 1 <= number of nonterminals <= 50
- Only direct left recursion must be eliminated
- Factoring is repeated only on common first symbols
Examples
Input: {'A': [['A', 'a'], ['A', 'b'], ['c'], ['d']]}
Expected Output: {'A': [['c', 'A_rec'], ['d', 'A_rec']], 'A_rec': [['a', 'A_rec'], ['b', 'A_rec'], ['ε']]}
Explanation: Direct left recursion is eliminated in the standard way.
Input: {'S': [['a', 'b', 'c'], ['a', 'b', 'd'], ['a', 'e']]}
Expected Output: {'S': [['a', 'F1']], 'F1': [['b', 'F2'], ['e']], 'F2': [['c'], ['d']]}
Explanation: Factoring happens twice: first on a, then inside F1 on b.
Input: {'S': [['S', 'a'], ['b'], ['b', 'c']]}
Expected Output: {'S': [['b', 'F1']], 'S_rec': [['a', 'S_rec'], ['ε']], 'F1': [['S_rec'], ['c', 'S_rec']]}
Explanation: First remove direct left recursion, then factor the two b-starting productions.
Input: {'S': [['a'], ['b']]}
Expected Output: {'S': [['a'], ['b']]}
Explanation: Edge case: no transformation is needed.
Hints
- Split A productions into recursive ones of the form A -> A α and non-recursive ones A -> β.
- After creating one factoring helper, restart the scan because the new grammar may contain another factorable nonterminal.
Part 4: Compute FIRST and FOLLOW Sets, Check LL(1), and Choose an Alternative Parser
Constraints
- 1 <= number of nonterminals <= 100
- Epsilon is represented by 'ε'
- All productions are lists of symbols
Examples
Input: {'start': 'E', 'grammar': {'E': [['T', 'Ep']], 'Ep': [['+', 'T', 'Ep'], ['ε']], 'T': [['id'], ['(', 'E', ')']]}}
Expected Output: {'first': {'E': ['(', 'id'], 'Ep': ['+', 'ε'], 'T': ['(', 'id']}, 'follow': {'E': ['$', ')'], 'Ep': ['$', ')'], 'T': ['$', ')', '+']}, 'is_ll1': True, 'alternative': 'LL(1)', 'reason': 'No LL(1) conflicts'}
Explanation: This is a standard LL(1) expression grammar.
Input: {'start': 'E', 'grammar': {'E': [['E', '+', 'T'], ['T']], 'T': [['id']]}}
Expected Output: {'first': {'E': ['id'], 'T': ['id']}, 'follow': {'E': ['$', '+'], 'T': ['$', '+']}, 'is_ll1': False, 'alternative': 'LR(1)', 'reason': 'Left recursion detected'}
Explanation: The grammar is not LL(1) and left recursion suggests an LR-style parser.
Input: {'start': 'S', 'grammar': {'S': [['a', 'A'], ['a', 'B']], 'A': [['c']], 'B': [['d']]}}
Expected Output: {'first': {'A': ['c'], 'B': ['d'], 'S': ['a']}, 'follow': {'A': ['$'], 'B': ['$'], 'S': ['$']}, 'is_ll1': False, 'alternative': 'PEG', 'reason': 'Productions share a prefix token'}
Explanation: There is a FIRST/FIRST conflict but no left recursion.
Input: {'start': 'S', 'grammar': {'S': [['A', 'b']], 'A': [['ε'], ['b']]}}
Expected Output: {'first': {'A': ['b', 'ε'], 'S': ['b']}, 'follow': {'A': ['b'], 'S': ['$']}, 'is_ll1': False, 'alternative': 'Earley', 'reason': 'Nullable/FOLLOW conflict requires a more general parser'}
Explanation: The conflict arises from epsilon and FOLLOW, not from a shared non-epsilon FIRST token.
Hints
- For LL(1), check both FIRST/FIRST conflicts and epsilon-related FIRST/FOLLOW conflicts.
- FOLLOW of the start symbol always contains $.
Part 5: Build an AST with a Lexer and Recursive-Descent Parser
Constraints
- 0 <= input length <= 100000
- Numbers are non-negative integers
- Input programs in tests are syntactically valid
Examples
Input: 'x = 1 + 2 * 3; y = (x + 4);'
Expected Output: {'type': 'Program', 'body': [{'type': 'Assign', 'name': 'x', 'value': {'type': 'BinOp', 'op': '+', 'left': {'type': 'Number', 'value': 1}, 'right': {'type': 'BinOp', 'op': '*', 'left': {'type': 'Number', 'value': 2}, 'right': {'type': 'Number', 'value': 3}}}}, {'type': 'Assign', 'name': 'y', 'value': {'type': 'BinOp', 'op': '+', 'left': {'type': 'Identifier', 'name': 'x'}, 'right': {'type': 'Number', 'value': 4}}}]}
Explanation: Multiplication binds tighter than addition.
Input: '1 + 2;'
Expected Output: {'type': 'Program', 'body': [{'type': 'ExprStmt', 'expr': {'type': 'BinOp', 'op': '+', 'left': {'type': 'Number', 'value': 1}, 'right': {'type': 'Number', 'value': 2}}}]}
Explanation: A bare expression becomes an ExprStmt.
Input: ''
Expected Output: {'type': 'Program', 'body': []}
Explanation: Edge case: empty program.
Input: 'z = (1 + 2) * (3 + 4);'
Expected Output: {'type': 'Program', 'body': [{'type': 'Assign', 'name': 'z', 'value': {'type': 'BinOp', 'op': '*', 'left': {'type': 'BinOp', 'op': '+', 'left': {'type': 'Number', 'value': 1}, 'right': {'type': 'Number', 'value': 2}}, 'right': {'type': 'BinOp', 'op': '+', 'left': {'type': 'Number', 'value': 3}, 'right': {'type': 'Number', 'value': 4}}}}]}
Explanation: Parentheses force grouping.
Hints
- Tokenize first so the parser only works with token kinds, not raw characters.
- Use one function per precedence level: expr, term, factor.
Part 6: Parse Statements with Error Reporting and Semicolon-Based Recovery
Constraints
- 0 <= input length <= 100000
- Identifiers use letters, digits, and underscores and do not start with a digit
- Recovery skips to the next semicolon or EOF
Examples
Input: 'let x = 10; print x;'
Expected Output: {'valid_statements': 2, 'errors': []}
Explanation: Both statements are valid.
Input: 'let = 10; print x;'
Expected Output: {'valid_statements': 1, 'errors': [{'position': 4, 'expected': ['IDENT'], 'found': '='}]}
Explanation: The parser recovers after the bad let statement and still parses print x;.
Input: 'print ; let y = 2;'
Expected Output: {'valid_statements': 1, 'errors': [{'position': 6, 'expected': ['IDENT'], 'found': ';'}]}
Explanation: print must be followed by an identifier.
Input: ''
Expected Output: {'valid_statements': 0, 'errors': []}
Explanation: Edge case: empty program.
Input: 'let x = 1'
Expected Output: {'valid_statements': 0, 'errors': [{'position': 9, 'expected': [';'], 'found': 'EOF'}]}
Explanation: Missing semicolon at end of file.
Hints
- Keep token positions from the lexer so the parser can report exact character indices.
- After the first error in a statement, do not try to continue parsing that same statement.
Part 7: Parse Expressions Using Declared Operator Precedence and Associativity
Constraints
- 1 <= number of declared operators <= 20
- Operators in tests do not overlap with identifiers or numbers
- Expressions in tests are syntactically valid
Examples
Input: {'declarations': [['left', ['+', '-']], ['left', ['*', '/']], ['right', ['^']]], 'expression': '1 + 2 * 3'}
Expected Output: ('+', 1, ('*', 2, 3))
Explanation: Multiplication binds tighter than addition.
Input: {'declarations': [['left', ['+', '-']], ['left', ['*', '/']], ['right', ['^']]], 'expression': '2 ^ 3 ^ 2'}
Expected Output: ('^', 2, ('^', 3, 2))
Explanation: Right associativity makes exponentiation nest to the right.
Input: {'declarations': [['right', ['+']]], 'expression': 'a+b+c'}
Expected Output: ('+', 'a', ('+', 'b', 'c'))
Explanation: The declaration forces + to associate to the right.
Input: {'declarations': [['left', ['+', '-']], ['left', ['*', '/']]], 'expression': '(1+2)*3'}
Expected Output: ('*', ('+', 1, 2), 3)
Explanation: Parentheses override precedence.
Input: {'declarations': [['left', ['+']]], 'expression': '42'}
Expected Output: 42
Explanation: Edge case: a single operand needs no operator node.
Hints
- A precedence-climbing parser is a good fit here.
- For left associativity, recurse with precedence + 1; for right associativity, recurse with the same precedence.
Part 8: Choose the Best Parser Under Time and Memory Cost Models
Constraints
- Candidate names are drawn from LL1, LR1, Earley, and PEG
- 0 <= number of workloads <= 1000
- All numeric inputs are non-negative integers
Examples
Input: {'candidates': ['LL1', 'LR1', 'Earley', 'PEG'], 'n': 3, 'p': 5, 'l': 12, 'm': 6, 'token_counts': [10, 20, 5], 'memory_budget': 100, 'ambiguous': False}
Expected Output: {'parser': 'LL1', 'time': 89, 'space': 18}
Explanation: LL1 has the lowest feasible total cost.
Input: {'candidates': ['Earley', 'PEG'], 'n': 2, 'p': 2, 'l': 3, 'm': 2, 'token_counts': [1], 'memory_budget': 10, 'ambiguous': False}
Expected Output: {'parser': 'Earley', 'time': 4, 'space': 1}
Explanation: For this tiny workload, Earley is cheaper than PEG.
Input: {'candidates': ['LL1', 'PEG'], 'n': 4, 'p': 4, 'l': 10, 'm': 5, 'token_counts': [20], 'memory_budget': 15, 'ambiguous': False}
Expected Output: {'parser': None, 'time': None, 'space': None}
Explanation: Edge case: no candidate fits within the memory budget.
Input: {'candidates': ['LL1', 'LR1'], 'n': 2, 'p': 3, 'l': 5, 'm': 4, 'token_counts': [], 'memory_budget': 20, 'ambiguous': False}
Expected Output: {'parser': 'LL1', 'time': 18, 'space': 8}
Explanation: With no programs to parse, only build cost matters.
Hints
- Build cost is paid once, but parse cost is paid for every token count in token_counts.
- For Earley and PEG, working memory depends on the largest input, not the sum.
Part 9: Stream and Parse Statements Incrementally from Input Chunks
Constraints
- 0 <= total input length <= 100000
- Completed statements in tests are syntactically valid
- The remainder may contain an incomplete trailing statement
Examples
Input: ['x = 1 +', ' 2;', 'y = x * 3;']
Expected Output: {'statements': [{'type': 'Assign', 'name': 'x', 'value': {'type': 'BinOp', 'op': '+', 'left': {'type': 'Number', 'value': 1}, 'right': {'type': 'Number', 'value': 2}}}, {'type': 'Assign', 'name': 'y', 'value': {'type': 'BinOp', 'op': '*', 'left': {'type': 'Identifier', 'name': 'x'}, 'right': {'type': 'Number', 'value': 3}}}], 'remainder': ''}
Explanation: The first statement is completed across chunk boundaries.
Input: ['(1+2', ')*3;', 'z=4']
Expected Output: {'statements': [{'type': 'ExprStmt', 'expr': {'type': 'BinOp', 'op': '*', 'left': {'type': 'BinOp', 'op': '+', 'left': {'type': 'Number', 'value': 1}, 'right': {'type': 'Number', 'value': 2}}, 'right': {'type': 'Number', 'value': 3}}}], 'remainder': 'z=4'}
Explanation: Only the first top-level semicolon-terminated statement is complete.
Input: []
Expected Output: {'statements': [], 'remainder': ''}
Explanation: Edge case: no input chunks.
Input: ['x=1;', 'y=2', '+3;']
Expected Output: {'statements': [{'type': 'Assign', 'name': 'x', 'value': {'type': 'Number', 'value': 1}}, {'type': 'Assign', 'name': 'y', 'value': {'type': 'BinOp', 'op': '+', 'left': {'type': 'Number', 'value': 2}, 'right': {'type': 'Number', 'value': 3}}}], 'remainder': ''}
Explanation: The second assignment is incomplete until the final chunk arrives.
Hints
- Track parenthesis depth while reading the stream.
- You only need to parse a statement once its terminating semicolon appears at depth 0.