PracHub
QuestionsCoachesLearningGuidesInterview Prep

Quick Overview

This question evaluates competency in compiler and language design, covering grammar analysis, parsing algorithms (LL/LR/GLR/Earley/PEG), lexer and parser implementation, AST construction, error reporting and recovery, operator precedence and associativity, and analysis of time/space trade-offs.

  • Medium
  • OpenAI
  • Coding & Algorithms
  • Software Engineer

Design a parser for a hypothetical language

Company: OpenAI

Role: Software Engineer

Category: Coding & Algorithms

Difficulty: Medium

Interview Round: Onsite

You are given a set of syntax rules for a hypothetical programming language written in BNF/EBNF. Design data structures and algorithms that: ( 1) parse the grammar into an internal representation (terminals, nonterminals, productions, precedence/associativity); ( 2) detect issues such as left recursion, ambiguity, unreachable or non-productive symbols; ( 3) apply transformations (left-recursion elimination, left factoring) to enable efficient parsing; ( 4) compute FIRST/FOLLOW sets and determine whether the grammar is LL ( 1); if not LL ( 1), justify choosing LR ( 1), GLR, Earley, or PEG parsing and outline construction; ( 5) implement a lexer and parser that build an abstract syntax tree (AST) for input programs; ( 6) provide error reporting and recovery with expected-token hints; ( 7) support operator precedence and associativity declarations; ( 8) analyze time and space complexity for grammar processing and parsing, and discuss trade-offs; ( 9) optionally support incremental or streaming parsing. Provide pseudocode and explain your design choices.

Quick Answer: This question evaluates competency in compiler and language design, covering grammar analysis, parsing algorithms (LL/LR/GLR/Earley/PEG), lexer and parser implementation, AST construction, error reporting and recovery, operator precedence and associativity, and analysis of time/space trade-offs.

Part 1: Parse a Grammar Specification into an Internal Representation

You are given a list of grammar specification lines. Each line is either a precedence declaration such as %left + - or a production such as Expr -> Expr + Term | Term. Parse the specification into an internal representation containing the set of nonterminals, the set of terminals, the flattened production list, and the precedence/associativity table. Symbols are separated by spaces. Epsilon is written as ε. Any symbol that appears on a left-hand side is a nonterminal; every other non-ε symbol appearing on a right-hand side is a terminal.

Constraints

  • 0 <= number of lines <= 200
  • Each production contains exactly one ->
  • Symbols are whitespace-separated and epsilon is represented by 'ε'

Examples

Input: ['%left + -', '%left * /', 'Expr -> Expr + Term | Expr - Term | Term', 'Term -> Term * Factor | Term / Factor | Factor', 'Factor -> num | ( Expr )']

Expected Output: {'nonterminals': ['Expr', 'Factor', 'Term'], 'terminals': ['(', ')', '*', '+', '-', '/', 'num'], 'productions': [{'lhs': 'Expr', 'rhs': ['Expr', '+', 'Term']}, {'lhs': 'Expr', 'rhs': ['Expr', '-', 'Term']}, {'lhs': 'Expr', 'rhs': ['Term']}, {'lhs': 'Term', 'rhs': ['Term', '*', 'Factor']}, {'lhs': 'Term', 'rhs': ['Term', '/', 'Factor']}, {'lhs': 'Term', 'rhs': ['Factor']}, {'lhs': 'Factor', 'rhs': ['num']}, {'lhs': 'Factor', 'rhs': ['(', 'Expr', ')']}], 'precedence': {'+': {'assoc': 'left', 'level': 1}, '-': {'assoc': 'left', 'level': 1}, '*': {'assoc': 'left', 'level': 2}, '/': {'assoc': 'left', 'level': 2}}}

Explanation: This includes both directives and production rules.

Input: []

Expected Output: {'nonterminals': [], 'terminals': [], 'productions': [], 'precedence': {}}

Explanation: Edge case: empty specification.

Input: ['S -> ε']

Expected Output: {'nonterminals': ['S'], 'terminals': [], 'productions': [{'lhs': 'S', 'rhs': ['ε']}], 'precedence': {}}

Explanation: A grammar with only epsilon has no terminals.

Input: ['%right ^', 'Atom -> id | num', 'Pow -> Atom ^ Pow | Atom']

Expected Output: {'nonterminals': ['Atom', 'Pow'], 'terminals': ['^', 'id', 'num'], 'productions': [{'lhs': 'Atom', 'rhs': ['id']}, {'lhs': 'Atom', 'rhs': ['num']}, {'lhs': 'Pow', 'rhs': ['Atom', '^', 'Pow']}, {'lhs': 'Pow', 'rhs': ['Atom']}], 'precedence': {'^': {'assoc': 'right', 'level': 1}}}

Explanation: A single precedence declaration assigns level 1.

Hints

  1. First collect every left-hand side before classifying terminals.
  2. Flatten alternatives after splitting each right-hand side on |.

Part 2: Detect Left Recursion, Unreachable Symbols, Non-Productive Symbols, and FIRST-Set Conflicts

Given a context-free grammar, detect four classes of issues: left-recursive nonterminals, unreachable nonterminals, non-productive nonterminals, and ambiguous nonterminals under a simplified heuristic. The ambiguity heuristic for this problem is: a nonterminal is flagged if two of its productions have overlapping FIRST sets. Epsilon is written as ε.

Constraints

  • 1 <= number of nonterminals <= 100
  • Each production is a list of strings
  • The start symbol always exists in the grammar

Examples

Input: ('S', {'S': [['S', 'a'], ['A'], ['b']], 'A': [['ε'], ['c']], 'B': [['d']]})

Expected Output: {'left_recursive': ['S'], 'unreachable': ['B'], 'non_productive': [], 'ambiguous_nonterminals': ['S']}

Explanation: S is directly left-recursive through S -> S a. B is never reached from the start symbol. All nonterminals are productive. S is ambiguous under the heuristic because FIRST(S a) overlaps with FIRST(A) and FIRST(b).

Input: ('S', {'S': [['A', 'm']], 'A': [['B']], 'B': [['S'], ['t'], ['t', 'u']], 'C': [['D']], 'D': [['C']]})

Expected Output: {'left_recursive': ['A', 'B', 'S'], 'unreachable': ['C', 'D'], 'non_productive': ['C', 'D'], 'ambiguous_nonterminals': ['B']}

Explanation: S, A, and B form a productive left-recursive cycle. C and D form a dead cycle, so they are unreachable and non-productive, but they should not be reported as left-recursive for this problem. B is ambiguous because FIRST(S), FIRST(t), and FIRST(t u) all contain 't'.

Input: ('S', {'S': [['A']], 'A': [['B']], 'B': [['C']], 'C': [['C']]})

Expected Output: {'left_recursive': [], 'unreachable': [], 'non_productive': ['A', 'B', 'C', 'S'], 'ambiguous_nonterminals': []}

Explanation: Every nonterminal is reachable from S, but none can derive any terminal string. The self-cycle at C is a dead cycle, so no nonterminal is reported as left-recursive under the productive-left-recursion rule.

Input: ('S', {'S': [['ε']]})

Expected Output: {'left_recursive': [], 'unreachable': [], 'non_productive': [], 'ambiguous_nonterminals': []}

Explanation: The grammar contains a single productive epsilon production. There is no recursion, no unreachable symbol, and only one production, so there is no FIRST-set conflict.

Hints

  1. Unreachable symbols come from a graph traversal starting at the start symbol.
  2. A productive nonterminal is one that can eventually derive only terminals and/or ε.

Part 3: Transform a Grammar by Eliminating Direct Left Recursion and Left-Factoring

Given a grammar, transform it in two steps: first eliminate direct left recursion for every nonterminal, then repeatedly left-factor any set of productions that share the same first symbol. For this problem, left factoring is only by the first symbol, repeated until no such conflict remains. Use A_rec for the helper created by left-recursion elimination and F1, F2, ... for factoring helpers.

Constraints

  • 1 <= number of nonterminals <= 50
  • Only direct left recursion must be eliminated
  • Factoring is repeated only on common first symbols

Examples

Input: {'A': [['A', 'a'], ['A', 'b'], ['c'], ['d']]}

Expected Output: {'A': [['c', 'A_rec'], ['d', 'A_rec']], 'A_rec': [['a', 'A_rec'], ['b', 'A_rec'], ['ε']]}

Explanation: Direct left recursion is eliminated in the standard way.

Input: {'S': [['a', 'b', 'c'], ['a', 'b', 'd'], ['a', 'e']]}

Expected Output: {'S': [['a', 'F1']], 'F1': [['b', 'F2'], ['e']], 'F2': [['c'], ['d']]}

Explanation: Factoring happens twice: first on a, then inside F1 on b.

Input: {'S': [['S', 'a'], ['b'], ['b', 'c']]}

Expected Output: {'S': [['b', 'F1']], 'S_rec': [['a', 'S_rec'], ['ε']], 'F1': [['S_rec'], ['c', 'S_rec']]}

Explanation: First remove direct left recursion, then factor the two b-starting productions.

Input: {'S': [['a'], ['b']]}

Expected Output: {'S': [['a'], ['b']]}

Explanation: Edge case: no transformation is needed.

Hints

  1. Split A productions into recursive ones of the form A -> A α and non-recursive ones A -> β.
  2. After creating one factoring helper, restart the scan because the new grammar may contain another factorable nonterminal.

Part 4: Compute FIRST and FOLLOW Sets, Check LL(1), and Choose an Alternative Parser

Given a grammar and a start symbol, compute FIRST and FOLLOW sets for every nonterminal, determine whether the grammar is LL(1), and return a deterministic parser choice. Use this choice rule: return LL(1) if the grammar is LL(1); otherwise choose LR(1) if left recursion exists, PEG if there is a FIRST/FIRST conflict without left recursion, and Earley otherwise. Epsilon is written as ε and end-of-input is $.

Constraints

  • 1 <= number of nonterminals <= 100
  • Epsilon is represented by 'ε'
  • All productions are lists of symbols

Examples

Input: {'start': 'E', 'grammar': {'E': [['T', 'Ep']], 'Ep': [['+', 'T', 'Ep'], ['ε']], 'T': [['id'], ['(', 'E', ')']]}}

Expected Output: {'first': {'E': ['(', 'id'], 'Ep': ['+', 'ε'], 'T': ['(', 'id']}, 'follow': {'E': ['$', ')'], 'Ep': ['$', ')'], 'T': ['$', ')', '+']}, 'is_ll1': True, 'alternative': 'LL(1)', 'reason': 'No LL(1) conflicts'}

Explanation: This is a standard LL(1) expression grammar.

Input: {'start': 'E', 'grammar': {'E': [['E', '+', 'T'], ['T']], 'T': [['id']]}}

Expected Output: {'first': {'E': ['id'], 'T': ['id']}, 'follow': {'E': ['$', '+'], 'T': ['$', '+']}, 'is_ll1': False, 'alternative': 'LR(1)', 'reason': 'Left recursion detected'}

Explanation: The grammar is not LL(1) and left recursion suggests an LR-style parser.

Input: {'start': 'S', 'grammar': {'S': [['a', 'A'], ['a', 'B']], 'A': [['c']], 'B': [['d']]}}

Expected Output: {'first': {'A': ['c'], 'B': ['d'], 'S': ['a']}, 'follow': {'A': ['$'], 'B': ['$'], 'S': ['$']}, 'is_ll1': False, 'alternative': 'PEG', 'reason': 'Productions share a prefix token'}

Explanation: There is a FIRST/FIRST conflict but no left recursion.

Input: {'start': 'S', 'grammar': {'S': [['A', 'b']], 'A': [['ε'], ['b']]}}

Expected Output: {'first': {'A': ['b', 'ε'], 'S': ['b']}, 'follow': {'A': ['b'], 'S': ['$']}, 'is_ll1': False, 'alternative': 'Earley', 'reason': 'Nullable/FOLLOW conflict requires a more general parser'}

Explanation: The conflict arises from epsilon and FOLLOW, not from a shared non-epsilon FIRST token.

Hints

  1. For LL(1), check both FIRST/FIRST conflicts and epsilon-related FIRST/FOLLOW conflicts.
  2. FOLLOW of the start symbol always contains $.

Part 5: Build an AST with a Lexer and Recursive-Descent Parser

Implement a lexer and parser for a tiny language of arithmetic expressions and assignments. The grammar is: program -> stmt*; stmt -> IDENT = expr ; | expr ; ; expr -> term ((+|-) term)* ; term -> factor ((*|/) factor)* ; factor -> NUMBER | IDENT | ( expr ). Return the abstract syntax tree.

Constraints

  • 0 <= input length <= 100000
  • Numbers are non-negative integers
  • Input programs in tests are syntactically valid

Examples

Input: 'x = 1 + 2 * 3; y = (x + 4);'

Expected Output: {'type': 'Program', 'body': [{'type': 'Assign', 'name': 'x', 'value': {'type': 'BinOp', 'op': '+', 'left': {'type': 'Number', 'value': 1}, 'right': {'type': 'BinOp', 'op': '*', 'left': {'type': 'Number', 'value': 2}, 'right': {'type': 'Number', 'value': 3}}}}, {'type': 'Assign', 'name': 'y', 'value': {'type': 'BinOp', 'op': '+', 'left': {'type': 'Identifier', 'name': 'x'}, 'right': {'type': 'Number', 'value': 4}}}]}

Explanation: Multiplication binds tighter than addition.

Input: '1 + 2;'

Expected Output: {'type': 'Program', 'body': [{'type': 'ExprStmt', 'expr': {'type': 'BinOp', 'op': '+', 'left': {'type': 'Number', 'value': 1}, 'right': {'type': 'Number', 'value': 2}}}]}

Explanation: A bare expression becomes an ExprStmt.

Input: ''

Expected Output: {'type': 'Program', 'body': []}

Explanation: Edge case: empty program.

Input: 'z = (1 + 2) * (3 + 4);'

Expected Output: {'type': 'Program', 'body': [{'type': 'Assign', 'name': 'z', 'value': {'type': 'BinOp', 'op': '*', 'left': {'type': 'BinOp', 'op': '+', 'left': {'type': 'Number', 'value': 1}, 'right': {'type': 'Number', 'value': 2}}, 'right': {'type': 'BinOp', 'op': '+', 'left': {'type': 'Number', 'value': 3}, 'right': {'type': 'Number', 'value': 4}}}}]}

Explanation: Parentheses force grouping.

Hints

  1. Tokenize first so the parser only works with token kinds, not raw characters.
  2. Use one function per precedence level: expr, term, factor.

Part 6: Parse Statements with Error Reporting and Semicolon-Based Recovery

Implement a small parser with error recovery. The language contains two statement forms: let IDENT = NUMBER ; and print IDENT ;. When a syntax error occurs, report the character position, the expected token names, and the found token text. Then recover by skipping tokens until the next semicolon and continue parsing the remaining program.

Constraints

  • 0 <= input length <= 100000
  • Identifiers use letters, digits, and underscores and do not start with a digit
  • Recovery skips to the next semicolon or EOF

Examples

Input: 'let x = 10; print x;'

Expected Output: {'valid_statements': 2, 'errors': []}

Explanation: Both statements are valid.

Input: 'let = 10; print x;'

Expected Output: {'valid_statements': 1, 'errors': [{'position': 4, 'expected': ['IDENT'], 'found': '='}]}

Explanation: The parser recovers after the bad let statement and still parses print x;.

Input: 'print ; let y = 2;'

Expected Output: {'valid_statements': 1, 'errors': [{'position': 6, 'expected': ['IDENT'], 'found': ';'}]}

Explanation: print must be followed by an identifier.

Input: ''

Expected Output: {'valid_statements': 0, 'errors': []}

Explanation: Edge case: empty program.

Input: 'let x = 1'

Expected Output: {'valid_statements': 0, 'errors': [{'position': 9, 'expected': [';'], 'found': 'EOF'}]}

Explanation: Missing semicolon at end of file.

Hints

  1. Keep token positions from the lexer so the parser can report exact character indices.
  2. After the first error in a statement, do not try to continue parsing that same statement.

Part 7: Parse Expressions Using Declared Operator Precedence and Associativity

You are given operator declarations and an expression. Build an AST that respects the declared precedence and associativity. Declarations are provided from lowest precedence to highest precedence. Operands are integers, identifiers, or parenthesized subexpressions. Return the AST as nested tuples of the form (operator, left, right). Leaf nodes are integers or identifier strings.

Constraints

  • 1 <= number of declared operators <= 20
  • Operators in tests do not overlap with identifiers or numbers
  • Expressions in tests are syntactically valid

Examples

Input: {'declarations': [['left', ['+', '-']], ['left', ['*', '/']], ['right', ['^']]], 'expression': '1 + 2 * 3'}

Expected Output: ('+', 1, ('*', 2, 3))

Explanation: Multiplication binds tighter than addition.

Input: {'declarations': [['left', ['+', '-']], ['left', ['*', '/']], ['right', ['^']]], 'expression': '2 ^ 3 ^ 2'}

Expected Output: ('^', 2, ('^', 3, 2))

Explanation: Right associativity makes exponentiation nest to the right.

Input: {'declarations': [['right', ['+']]], 'expression': 'a+b+c'}

Expected Output: ('+', 'a', ('+', 'b', 'c'))

Explanation: The declaration forces + to associate to the right.

Input: {'declarations': [['left', ['+', '-']], ['left', ['*', '/']]], 'expression': '(1+2)*3'}

Expected Output: ('*', ('+', 1, 2), 3)

Explanation: Parentheses override precedence.

Input: {'declarations': [['left', ['+']]], 'expression': '42'}

Expected Output: 42

Explanation: Edge case: a single operand needs no operator node.

Hints

  1. A precedence-climbing parser is a good fit here.
  2. For left associativity, recurse with precedence + 1; for right associativity, recurse with the same precedence.

Part 8: Choose the Best Parser Under Time and Memory Cost Models

You are given grammar statistics, a memory budget, and several program lengths to parse with the same grammar. Compute the estimated total cost of each candidate parser and choose the fastest feasible one. Use these models: LL1 -> build = n*l + n*m, parse each program = t, space = n*m; LR1 -> build = p*l + n*m, parse each program = t, space = p*m + n; Earley -> build = l, parse each program = t^3 if ambiguous else t^2, space = max_t^2; PEG -> build = l, parse each program = l*t, space = l + max_t. Total time is build plus the sum of per-program parse times. If multiple parsers tie on time, choose the one with smaller space; if still tied, choose lexicographically smaller parser name.

Constraints

  • Candidate names are drawn from LL1, LR1, Earley, and PEG
  • 0 <= number of workloads <= 1000
  • All numeric inputs are non-negative integers

Examples

Input: {'candidates': ['LL1', 'LR1', 'Earley', 'PEG'], 'n': 3, 'p': 5, 'l': 12, 'm': 6, 'token_counts': [10, 20, 5], 'memory_budget': 100, 'ambiguous': False}

Expected Output: {'parser': 'LL1', 'time': 89, 'space': 18}

Explanation: LL1 has the lowest feasible total cost.

Input: {'candidates': ['Earley', 'PEG'], 'n': 2, 'p': 2, 'l': 3, 'm': 2, 'token_counts': [1], 'memory_budget': 10, 'ambiguous': False}

Expected Output: {'parser': 'Earley', 'time': 4, 'space': 1}

Explanation: For this tiny workload, Earley is cheaper than PEG.

Input: {'candidates': ['LL1', 'PEG'], 'n': 4, 'p': 4, 'l': 10, 'm': 5, 'token_counts': [20], 'memory_budget': 15, 'ambiguous': False}

Expected Output: {'parser': None, 'time': None, 'space': None}

Explanation: Edge case: no candidate fits within the memory budget.

Input: {'candidates': ['LL1', 'LR1'], 'n': 2, 'p': 3, 'l': 5, 'm': 4, 'token_counts': [], 'memory_budget': 20, 'ambiguous': False}

Expected Output: {'parser': 'LL1', 'time': 18, 'space': 8}

Explanation: With no programs to parse, only build cost matters.

Hints

  1. Build cost is paid once, but parse cost is paid for every token count in token_counts.
  2. For Earley and PEG, working memory depends on the largest input, not the sum.

Part 9: Stream and Parse Statements Incrementally from Input Chunks

Implement streaming parsing for the tiny assignment/expression language from Part 5. Input arrives as a list of text chunks. As soon as a complete top-level semicolon-terminated statement is available, parse it and emit its AST. A semicolon only ends a statement when it is not inside parentheses. Return all completed statement ASTs plus the unconsumed remainder.

Constraints

  • 0 <= total input length <= 100000
  • Completed statements in tests are syntactically valid
  • The remainder may contain an incomplete trailing statement

Examples

Input: ['x = 1 +', ' 2;', 'y = x * 3;']

Expected Output: {'statements': [{'type': 'Assign', 'name': 'x', 'value': {'type': 'BinOp', 'op': '+', 'left': {'type': 'Number', 'value': 1}, 'right': {'type': 'Number', 'value': 2}}}, {'type': 'Assign', 'name': 'y', 'value': {'type': 'BinOp', 'op': '*', 'left': {'type': 'Identifier', 'name': 'x'}, 'right': {'type': 'Number', 'value': 3}}}], 'remainder': ''}

Explanation: The first statement is completed across chunk boundaries.

Input: ['(1+2', ')*3;', 'z=4']

Expected Output: {'statements': [{'type': 'ExprStmt', 'expr': {'type': 'BinOp', 'op': '*', 'left': {'type': 'BinOp', 'op': '+', 'left': {'type': 'Number', 'value': 1}, 'right': {'type': 'Number', 'value': 2}}, 'right': {'type': 'Number', 'value': 3}}}], 'remainder': 'z=4'}

Explanation: Only the first top-level semicolon-terminated statement is complete.

Input: []

Expected Output: {'statements': [], 'remainder': ''}

Explanation: Edge case: no input chunks.

Input: ['x=1;', 'y=2', '+3;']

Expected Output: {'statements': [{'type': 'Assign', 'name': 'x', 'value': {'type': 'Number', 'value': 1}}, {'type': 'Assign', 'name': 'y', 'value': {'type': 'BinOp', 'op': '+', 'left': {'type': 'Number', 'value': 2}, 'right': {'type': 'Number', 'value': 3}}}], 'remainder': ''}

Explanation: The second assignment is incomplete until the final chunk arrives.

Hints

  1. Track parenthesis depth while reading the stream.
  2. You only need to parse a statement once its terminating semicolon appears at depth 0.
Last updated: May 5, 2026

Loading coding console...

PracHub

Master your tech interviews with 8,000+ real questions from top companies.

Product

  • Questions
  • Learning Tracks
  • Interview Guides
  • Resources
  • Premium
  • For Universities
  • Student Access

Browse

  • By Company
  • By Role
  • By Category
  • Topic Hubs
  • SQL Questions
  • AI Coding Questions
  • Compare Platforms
  • Discord Community

Support

  • support@prachub.com
  • (916) 541-4762

Legal

  • Privacy Policy
  • Terms of Service
  • About Us

© 2026 PracHub. All rights reserved.

Related Coding Questions

  • Infection Spread Simulation with Death Threshold - OpenAI (medium)
  • Spreading Contagion on a Grid - OpenAI (medium)
  • Streaming Entropy with Numerical Stability - OpenAI (hard)
  • Implement a Distributed Rate Limiter - OpenAI (medium)
  • Compute Plant Infection Time - OpenAI (medium)