Design a parser for a hypothetical language
Company: OpenAI
Role: Software Engineer
Category: Coding & Algorithms
Difficulty: Medium
Interview Round: Onsite
Quick Answer: This question evaluates competency in compiler and language design, covering grammar analysis, parsing algorithms (LL/LR/GLR/Earley/PEG), lexer and parser implementation, AST construction, error reporting and recovery, operator precedence and associativity, and analysis of time/space trade-offs.
Part 1: Parse a Grammar Specification into an Internal Representation
Constraints
- 0 <= number of lines <= 200
- Each production contains exactly one ->
- Symbols are whitespace-separated and epsilon is represented by 'ε'
Examples
Input: ['%left + -', '%left * /', 'Expr -> Expr + Term | Expr - Term | Term', 'Term -> Term * Factor | Term / Factor | Factor', 'Factor -> num | ( Expr )']
Expected Output: {'nonterminals': ['Expr', 'Factor', 'Term'], 'terminals': ['(', ')', '*', '+', '-', '/', 'num'], 'productions': [{'lhs': 'Expr', 'rhs': ['Expr', '+', 'Term']}, {'lhs': 'Expr', 'rhs': ['Expr', '-', 'Term']}, {'lhs': 'Expr', 'rhs': ['Term']}, {'lhs': 'Term', 'rhs': ['Term', '*', 'Factor']}, {'lhs': 'Term', 'rhs': ['Term', '/', 'Factor']}, {'lhs': 'Term', 'rhs': ['Factor']}, {'lhs': 'Factor', 'rhs': ['num']}, {'lhs': 'Factor', 'rhs': ['(', 'Expr', ')']}], 'precedence': {'+': {'assoc': 'left', 'level': 1}, '-': {'assoc': 'left', 'level': 1}, '*': {'assoc': 'left', 'level': 2}, '/': {'assoc': 'left', 'level': 2}}}
Explanation: This includes both directives and production rules.
Input: []
Expected Output: {'nonterminals': [], 'terminals': [], 'productions': [], 'precedence': {}}
Explanation: Edge case: empty specification.
Input: ['S -> ε']
Expected Output: {'nonterminals': ['S'], 'terminals': [], 'productions': [{'lhs': 'S', 'rhs': ['ε']}], 'precedence': {}}
Explanation: A grammar with only epsilon has no terminals.
Input: ['%right ^', 'Atom -> id | num', 'Pow -> Atom ^ Pow | Atom']
Expected Output: {'nonterminals': ['Atom', 'Pow'], 'terminals': ['^', 'id', 'num'], 'productions': [{'lhs': 'Atom', 'rhs': ['id']}, {'lhs': 'Atom', 'rhs': ['num']}, {'lhs': 'Pow', 'rhs': ['Atom', '^', 'Pow']}, {'lhs': 'Pow', 'rhs': ['Atom']}], 'precedence': {'^': {'assoc': 'right', 'level': 1}}}
Explanation: A single precedence declaration assigns level 1.
Solution
def solution(lines):
precedence = {}
parsed_rules = []
nonterminals = set()
level = 0
for raw in lines:
line = raw.strip()
if not line:
continue
if line.startswith('%'):
parts = line.split()
assoc = parts[0][1:]
level += 1
for tok in parts[1:]:
precedence[tok] = {'assoc': assoc, 'level': level}
else:
if '->' not in line:
continue
lhs, rhs = line.split('->', 1)
lhs = lhs.strip()
nonterminals.add(lhs)
parsed_rules.append((lhs, rhs.strip()))
productions = []
for lhs, rhs_text in parsed_rules:
for alt in rhs_text.split('|'):
alt = alt.strip()
rhs = alt.split() if alt else ['ε']
productions.append({'lhs': lhs, 'rhs': rhs or ['ε']})
terminals = set()
for prod in productions:
for sym in prod['rhs']:
if sym != 'ε' and sym not in nonterminals:
terminals.add(sym)
return {
'nonterminals': sorted(nonterminals),
'terminals': sorted(terminals),
'productions': productions,
'precedence': precedence
}Time complexity: O(L), where L is the total number of symbols/tokens across all lines. Space complexity: O(L).
Hints
- First collect every left-hand side before classifying terminals.
- Flatten alternatives after splitting each right-hand side on |.
Part 2: Detect Left Recursion, Unreachable Symbols, Non-Productive Symbols, and FIRST-Set Conflicts
Constraints
- 1 <= number of nonterminals <= 100
- Each production is a list of strings
- The start symbol always exists in the grammar
Examples
Input: ('S', {'S': [['S', 'a'], ['A'], ['b']], 'A': [['ε'], ['c']], 'B': [['d']]})
Expected Output: {'left_recursive': ['S'], 'unreachable': ['B'], 'non_productive': [], 'ambiguous_nonterminals': ['S']}
Explanation: S is directly left-recursive through S -> S a. B is never reached from the start symbol. All nonterminals are productive. S is ambiguous under the heuristic because FIRST(S a) overlaps with FIRST(A) and FIRST(b).
Input: ('S', {'S': [['A', 'm']], 'A': [['B']], 'B': [['S'], ['t'], ['t', 'u']], 'C': [['D']], 'D': [['C']]})
Expected Output: {'left_recursive': ['A', 'B', 'S'], 'unreachable': ['C', 'D'], 'non_productive': ['C', 'D'], 'ambiguous_nonterminals': ['B']}
Explanation: S, A, and B form a productive left-recursive cycle. C and D form a dead cycle, so they are unreachable and non-productive, but they should not be reported as left-recursive for this problem. B is ambiguous because FIRST(S), FIRST(t), and FIRST(t u) all contain 't'.
Input: ('S', {'S': [['A']], 'A': [['B']], 'B': [['C']], 'C': [['C']]})
Expected Output: {'left_recursive': [], 'unreachable': [], 'non_productive': ['A', 'B', 'C', 'S'], 'ambiguous_nonterminals': []}
Explanation: Every nonterminal is reachable from S, but none can derive any terminal string. The self-cycle at C is a dead cycle, so no nonterminal is reported as left-recursive under the productive-left-recursion rule.
Input: ('S', {'S': [['ε']]})
Expected Output: {'left_recursive': [], 'unreachable': [], 'non_productive': [], 'ambiguous_nonterminals': []}
Explanation: The grammar contains a single productive epsilon production. There is no recursion, no unreachable symbol, and only one production, so there is no FIRST-set conflict.
Solution
def solution(start_symbol, grammar):
EPS = 'ε'
nonterminals = set(grammar.keys())
def seq_is_nullable(prod, nullable):
if not prod:
return True
for sym in prod:
if sym == EPS:
continue
if sym in nonterminals and sym in nullable:
continue
return False
return True
def seq_is_productive(prod, productive):
if not prod:
return True
for sym in prod:
if sym == EPS:
continue
if sym in nonterminals:
if sym not in productive:
return False
else:
continue
return True
nullable = set()
changed = True
while changed:
changed = False
for nt, productions in grammar.items():
if nt in nullable:
continue
for prod in productions:
if seq_is_nullable(prod, nullable):
nullable.add(nt)
changed = True
break
productive = set()
changed = True
while changed:
changed = False
for nt, productions in grammar.items():
if nt in productive:
continue
for prod in productions:
if seq_is_productive(prod, productive):
productive.add(nt)
changed = True
break
first = {nt: set() for nt in nonterminals}
def first_of_sequence(prod):
if not prod:
return {EPS}
result = set()
all_nullable = True
for sym in prod:
if sym == EPS:
continue
if sym in nonterminals:
result |= (first[sym] - {EPS})
if sym in nullable:
continue
all_nullable = False
break
else:
result.add(sym)
all_nullable = False
break
if all_nullable:
result.add(EPS)
return result
changed = True
while changed:
changed = False
for nt, productions in grammar.items():
for prod in productions:
seq_first = first_of_sequence(prod)
if not seq_first.issubset(first[nt]):
first[nt] |= seq_first
changed = True
reachable = set()
stack = [start_symbol] if start_symbol in nonterminals else []
while stack:
nt = stack.pop()
if nt in reachable:
continue
reachable.add(nt)
for prod in grammar.get(nt, []):
for sym in prod:
if sym in nonterminals and sym not in reachable:
stack.append(sym)
unreachable = sorted(nonterminals - reachable)
non_productive = sorted(nonterminals - productive)
ambiguous = []
for nt, productions in grammar.items():
seen = set()
conflict = False
for prod in productions:
prod_first = first_of_sequence(prod)
if prod_first & seen:
conflict = True
break
seen |= prod_first
if conflict:
ambiguous.append(nt)
ambiguous.sort()
graph = {nt: set() for nt in productive}
for nt in productive:
for prod in grammar[nt]:
if not prod:
continue
for sym in prod:
if sym == EPS:
continue
if sym in nonterminals:
if sym in productive:
graph[nt].add(sym)
if sym in nullable:
continue
break
else:
break
def is_left_recursive(start):
stack = list(graph.get(start, ()))
visited = set()
while stack:
cur = stack.pop()
if cur == start:
return True
if cur in visited:
continue
visited.add(cur)
stack.extend(graph.get(cur, ()))
return False
left_recursive = sorted(nt for nt in productive if is_left_recursive(nt))
return {
'left_recursive': left_recursive,
'unreachable': unreachable,
'non_productive': non_productive,
'ambiguous_nonterminals': ambiguous
}Time complexity: O(N^3) worst-case, where N is the total size of the grammar representation. Space complexity: O(N^2) worst-case.
Hints
- Unreachable symbols come from a graph traversal starting at the start symbol.
- A productive nonterminal is one that can eventually derive only terminals and/or ε.
Part 3: Transform a Grammar by Eliminating Direct Left Recursion and Left-Factoring
Constraints
- 1 <= number of nonterminals <= 50
- Only direct left recursion must be eliminated
- Factoring is repeated only on common first symbols
Examples
Input: {'A': [['A', 'a'], ['A', 'b'], ['c'], ['d']]}
Expected Output: {'A': [['c', 'A_rec'], ['d', 'A_rec']], 'A_rec': [['a', 'A_rec'], ['b', 'A_rec'], ['ε']]}
Explanation: Direct left recursion is eliminated in the standard way.
Input: {'S': [['a', 'b', 'c'], ['a', 'b', 'd'], ['a', 'e']]}
Expected Output: {'S': [['a', 'F1']], 'F1': [['b', 'F2'], ['e']], 'F2': [['c'], ['d']]}
Explanation: Factoring happens twice: first on a, then inside F1 on b.
Input: {'S': [['S', 'a'], ['b'], ['b', 'c']]}
Expected Output: {'S': [['b', 'F1']], 'S_rec': [['a', 'S_rec'], ['ε']], 'F1': [['S_rec'], ['c', 'S_rec']]}
Explanation: First remove direct left recursion, then factor the two b-starting productions.
Input: {'S': [['a'], ['b']]}
Expected Output: {'S': [['a'], ['b']]}
Explanation: Edge case: no transformation is needed.
Solution
def solution(grammar):
g = {k: [prod[:] for prod in v] for k, v in grammar.items()}
transformed = {}
for nt in sorted(g):
prods = g[nt]
recursive = [prod[1:] for prod in prods if prod and prod[0] == nt]
non_recursive = [prod[:] for prod in prods if not (prod and prod[0] == nt)]
if recursive:
helper = nt + '_rec'
betas = non_recursive if non_recursive else [['ε']]
transformed[nt] = []
for beta in betas:
seq = [] if beta == ['ε'] else beta[:]
seq.append(helper)
transformed[nt].append(seq)
transformed[helper] = []
for alpha in recursive:
seq = [] if alpha == ['ε'] else alpha[:]
seq.append(helper)
transformed[helper].append(seq)
transformed[helper].append(['ε'])
else:
transformed[nt] = [prod[:] for prod in prods]
counter = 1
while True:
changed = False
for nt in sorted(list(transformed.keys())):
prods = transformed[nt]
groups = {}
for idx, prod in enumerate(prods):
key = None if prod == ['ε'] or not prod else prod[0]
groups.setdefault(key, []).append((idx, prod))
factorable = sorted(sym for sym, items in groups.items() if sym is not None and len(items) > 1)
if not factorable:
continue
sym = factorable[0]
helper = 'F' + str(counter)
counter += 1
used = {idx for idx, _ in groups[sym]}
new_nt_prods = []
helper_prods = []
inserted = False
for idx, prod in enumerate(prods):
if idx in used:
if not inserted:
new_nt_prods.append([sym, helper])
inserted = True
rest = prod[1:]
helper_prods.append(rest if rest else ['ε'])
else:
new_nt_prods.append(prod)
transformed[nt] = new_nt_prods
transformed[helper] = helper_prods
changed = True
break
if not changed:
break
return transformedTime complexity: O(R^2 + L) in the worst case, where R is the number of productions and L is the total RHS length. Space complexity: O(R + L).
Hints
- Split A productions into recursive ones of the form A -> A α and non-recursive ones A -> β.
- After creating one factoring helper, restart the scan because the new grammar may contain another factorable nonterminal.
Part 4: Compute FIRST and FOLLOW Sets, Check LL(1), and Choose an Alternative Parser
Constraints
- 1 <= number of nonterminals <= 100
- Epsilon is represented by 'ε'
- All productions are lists of symbols
Examples
Input: {'start': 'E', 'grammar': {'E': [['T', 'Ep']], 'Ep': [['+', 'T', 'Ep'], ['ε']], 'T': [['id'], ['(', 'E', ')']]}}
Expected Output: {'first': {'E': ['(', 'id'], 'Ep': ['+', 'ε'], 'T': ['(', 'id']}, 'follow': {'E': ['$', ')'], 'Ep': ['$', ')'], 'T': ['$', ')', '+']}, 'is_ll1': True, 'alternative': 'LL(1)', 'reason': 'No LL(1) conflicts'}
Explanation: This is a standard LL(1) expression grammar.
Input: {'start': 'E', 'grammar': {'E': [['E', '+', 'T'], ['T']], 'T': [['id']]}}
Expected Output: {'first': {'E': ['id'], 'T': ['id']}, 'follow': {'E': ['$', '+'], 'T': ['$', '+']}, 'is_ll1': False, 'alternative': 'LR(1)', 'reason': 'Left recursion detected'}
Explanation: The grammar is not LL(1) and left recursion suggests an LR-style parser.
Input: {'start': 'S', 'grammar': {'S': [['a', 'A'], ['a', 'B']], 'A': [['c']], 'B': [['d']]}}
Expected Output: {'first': {'A': ['c'], 'B': ['d'], 'S': ['a']}, 'follow': {'A': ['$'], 'B': ['$'], 'S': ['$']}, 'is_ll1': False, 'alternative': 'PEG', 'reason': 'Productions share a prefix token'}
Explanation: There is a FIRST/FIRST conflict but no left recursion.
Input: {'start': 'S', 'grammar': {'S': [['A', 'b']], 'A': [['ε'], ['b']]}}
Expected Output: {'first': {'A': ['b', 'ε'], 'S': ['b']}, 'follow': {'A': ['b'], 'S': ['$']}, 'is_ll1': False, 'alternative': 'Earley', 'reason': 'Nullable/FOLLOW conflict requires a more general parser'}
Explanation: The conflict arises from epsilon and FOLLOW, not from a shared non-epsilon FIRST token.
Solution
def solution(data):
grammar = {k: [prod[:] for prod in v] for k, v in data['grammar'].items()}
start = data['start']
nonterminals = set(grammar)
first = {nt: set() for nt in nonterminals}
changed = True
while changed:
changed = False
for nt, prods in grammar.items():
for prod in prods:
if prod == ['ε'] or not prod:
if 'ε' not in first[nt]:
first[nt].add('ε')
changed = True
continue
all_nullable = True
for sym in prod:
if sym == 'ε':
if 'ε' not in first[nt]:
first[nt].add('ε')
changed = True
break
if sym not in nonterminals:
if sym not in first[nt]:
first[nt].add(sym)
changed = True
all_nullable = False
break
before = len(first[nt])
first[nt] |= (first[sym] - {'ε'})
if len(first[nt]) != before:
changed = True
if 'ε' in first[sym]:
continue
all_nullable = False
break
else:
if all_nullable and 'ε' not in first[nt]:
first[nt].add('ε')
changed = True
def first_of_seq(seq):
if seq == ['ε'] or not seq:
return {'ε'}
out = set()
all_nullable = True
for sym in seq:
if sym == 'ε':
out.add('ε')
return out
if sym not in nonterminals:
out.add(sym)
return out
out |= (first[sym] - {'ε'})
if 'ε' in first[sym]:
continue
all_nullable = False
break
else:
all_nullable = True
if all_nullable:
out.add('ε')
return out
follow = {nt: set() for nt in nonterminals}
follow[start].add('$')
changed = True
while changed:
changed = False
for nt, prods in grammar.items():
for prod in prods:
for i, sym in enumerate(prod):
if sym not in nonterminals:
continue
beta = prod[i + 1:]
fb = first_of_seq(beta)
before = len(follow[sym])
follow[sym] |= (fb - {'ε'})
if 'ε' in fb or not beta:
follow[sym] |= follow[nt]
if len(follow[sym]) != before:
changed = True
nullable = {nt for nt in nonterminals if 'ε' in first[nt]}
graph = {nt: set() for nt in nonterminals}
for nt, prods in grammar.items():
for prod in prods:
prefix_nullable = True
for sym in prod:
if sym == 'ε':
continue
if sym in nonterminals:
if prefix_nullable:
graph[nt].add(sym)
if sym in nullable:
continue
break
break
left_recursive = False
for nt in nonterminals:
stack = list(graph[nt])
seen = set()
while stack:
cur = stack.pop()
if cur == nt:
left_recursive = True
stack = []
break
if cur in seen:
continue
seen.add(cur)
stack.extend(graph[cur] - seen)
if left_recursive:
break
is_ll1 = True
first_first_conflict = False
for nt, prods in grammar.items():
prod_firsts = [first_of_seq(prod) for prod in prods]
for i in range(len(prod_firsts)):
for j in range(i + 1, len(prod_firsts)):
if (prod_firsts[i] - {'ε'}) & (prod_firsts[j] - {'ε'}):
is_ll1 = False
first_first_conflict = True
if 'ε' in prod_firsts[i] and 'ε' in prod_firsts[j]:
is_ll1 = False
if 'ε' in prod_firsts[i] and (follow[nt] & (prod_firsts[j] - {'ε'})):
is_ll1 = False
if 'ε' in prod_firsts[j] and (follow[nt] & (prod_firsts[i] - {'ε'})):
is_ll1 = False
if is_ll1:
alternative = 'LL(1)'
reason = 'No LL(1) conflicts'
elif left_recursive:
alternative = 'LR(1)'
reason = 'Left recursion detected'
elif first_first_conflict:
alternative = 'PEG'
reason = 'Productions share a prefix token'
else:
alternative = 'Earley'
reason = 'Nullable/FOLLOW conflict requires a more general parser'
return {
'first': {nt: sorted(first[nt]) for nt in sorted(grammar)},
'follow': {nt: sorted(follow[nt]) for nt in sorted(grammar)},
'is_ll1': is_ll1,
'alternative': alternative,
'reason': reason
}Time complexity: O(N * L + N^2), where N is the number of nonterminals and L is the total RHS length. Space complexity: O(N + L).
Hints
- For LL(1), check both FIRST/FIRST conflicts and epsilon-related FIRST/FOLLOW conflicts.
- FOLLOW of the start symbol always contains $.
Part 5: Build an AST with a Lexer and Recursive-Descent Parser
Constraints
- 0 <= input length <= 100000
- Numbers are non-negative integers
- Input programs in tests are syntactically valid
Examples
Input: 'x = 1 + 2 * 3; y = (x + 4);'
Expected Output: {'type': 'Program', 'body': [{'type': 'Assign', 'name': 'x', 'value': {'type': 'BinOp', 'op': '+', 'left': {'type': 'Number', 'value': 1}, 'right': {'type': 'BinOp', 'op': '*', 'left': {'type': 'Number', 'value': 2}, 'right': {'type': 'Number', 'value': 3}}}}, {'type': 'Assign', 'name': 'y', 'value': {'type': 'BinOp', 'op': '+', 'left': {'type': 'Identifier', 'name': 'x'}, 'right': {'type': 'Number', 'value': 4}}}]}
Explanation: Multiplication binds tighter than addition.
Input: '1 + 2;'
Expected Output: {'type': 'Program', 'body': [{'type': 'ExprStmt', 'expr': {'type': 'BinOp', 'op': '+', 'left': {'type': 'Number', 'value': 1}, 'right': {'type': 'Number', 'value': 2}}}]}
Explanation: A bare expression becomes an ExprStmt.
Input: ''
Expected Output: {'type': 'Program', 'body': []}
Explanation: Edge case: empty program.
Input: 'z = (1 + 2) * (3 + 4);'
Expected Output: {'type': 'Program', 'body': [{'type': 'Assign', 'name': 'z', 'value': {'type': 'BinOp', 'op': '*', 'left': {'type': 'BinOp', 'op': '+', 'left': {'type': 'Number', 'value': 1}, 'right': {'type': 'Number', 'value': 2}}, 'right': {'type': 'BinOp', 'op': '+', 'left': {'type': 'Number', 'value': 3}, 'right': {'type': 'Number', 'value': 4}}}}]}
Explanation: Parentheses force grouping.
Solution
def solution(program):
def lex(text):
tokens = []
i = 0
while i < len(text):
c = text[i]
if c.isspace():
i += 1
elif c.isdigit():
j = i
while j < len(text) and text[j].isdigit():
j += 1
tokens.append(('NUMBER', int(text[i:j])))
i = j
elif c.isalpha() or c == '_':
j = i
while j < len(text) and (text[j].isalnum() or text[j] == '_'):
j += 1
tokens.append(('IDENT', text[i:j]))
i = j
elif c in '+-*/=();':
tokens.append((c, c))
i += 1
else:
raise ValueError('invalid character')
tokens.append(('EOF', 'EOF'))
return tokens
tokens = lex(program)
i = 0
def peek(k=0):
return tokens[i + k][0]
def eat(expected=None):
nonlocal i
tok = tokens[i]
if expected is not None and tok[0] != expected:
raise ValueError('syntax error')
i += 1
return tok
def parse_factor():
if peek() == 'NUMBER':
return {'type': 'Number', 'value': eat('NUMBER')[1]}
if peek() == 'IDENT':
return {'type': 'Identifier', 'name': eat('IDENT')[1]}
if peek() == '(':
eat('(')
node = parse_expr()
eat(')')
return node
raise ValueError('expected factor')
def parse_term():
node = parse_factor()
while peek() in ('*', '/'):
op = eat()[0]
right = parse_factor()
node = {'type': 'BinOp', 'op': op, 'left': node, 'right': right}
return node
def parse_expr():
node = parse_term()
while peek() in ('+', '-'):
op = eat()[0]
right = parse_term()
node = {'type': 'BinOp', 'op': op, 'left': node, 'right': right}
return node
def parse_stmt():
if peek() == 'IDENT' and peek(1) == '=':
name = eat('IDENT')[1]
eat('=')
value = parse_expr()
eat(';')
return {'type': 'Assign', 'name': name, 'value': value}
expr = parse_expr()
eat(';')
return {'type': 'ExprStmt', 'expr': expr}
body = []
while peek() != 'EOF':
body.append(parse_stmt())
return {'type': 'Program', 'body': body}Time complexity: O(n), where n is the length of the input. Space complexity: O(n).
Hints
- Tokenize first so the parser only works with token kinds, not raw characters.
- Use one function per precedence level: expr, term, factor.
Part 6: Parse Statements with Error Reporting and Semicolon-Based Recovery
Constraints
- 0 <= input length <= 100000
- Identifiers use letters, digits, and underscores and do not start with a digit
- Recovery skips to the next semicolon or EOF
Examples
Input: 'let x = 10; print x;'
Expected Output: {'valid_statements': 2, 'errors': []}
Explanation: Both statements are valid.
Input: 'let = 10; print x;'
Expected Output: {'valid_statements': 1, 'errors': [{'position': 4, 'expected': ['IDENT'], 'found': '='}]}
Explanation: The parser recovers after the bad let statement and still parses print x;.
Input: 'print ; let y = 2;'
Expected Output: {'valid_statements': 1, 'errors': [{'position': 6, 'expected': ['IDENT'], 'found': ';'}]}
Explanation: print must be followed by an identifier.
Input: ''
Expected Output: {'valid_statements': 0, 'errors': []}
Explanation: Edge case: empty program.
Input: 'let x = 1'
Expected Output: {'valid_statements': 0, 'errors': [{'position': 9, 'expected': [';'], 'found': 'EOF'}]}
Explanation: Missing semicolon at end of file.
Solution
def solution(program):
def lex(text):
tokens = []
i = 0
while i < len(text):
c = text[i]
if c.isspace():
i += 1
elif c.isalpha() or c == '_':
j = i
while j < len(text) and (text[j].isalnum() or text[j] == '_'):
j += 1
word = text[i:j]
typ = 'IDENT'
if word == 'let':
typ = 'LET'
elif word == 'print':
typ = 'PRINT'
tokens.append({'type': typ, 'value': word, 'pos': i})
i = j
elif c.isdigit():
j = i
while j < len(text) and text[j].isdigit():
j += 1
tokens.append({'type': 'NUMBER', 'value': text[i:j], 'pos': i})
i = j
elif c == '=':
tokens.append({'type': 'EQ', 'value': '=', 'pos': i})
i += 1
elif c == ';':
tokens.append({'type': 'SEMI', 'value': ';', 'pos': i})
i += 1
else:
tokens.append({'type': 'INVALID', 'value': c, 'pos': i})
i += 1
tokens.append({'type': 'EOF', 'value': 'EOF', 'pos': len(text)})
return tokens
tokens = lex(program)
i = 0
errors = []
valid = 0
def cur():
return tokens[i]
def advance():
nonlocal i
i += 1
def report(expected):
token = cur()
found = token['value'] if token['type'] != 'EOF' else 'EOF'
errors.append({'position': token['pos'], 'expected': sorted(expected), 'found': found})
def recover():
nonlocal i
while cur()['type'] not in ('SEMI', 'EOF'):
i += 1
if cur()['type'] == 'SEMI':
i += 1
while cur()['type'] != 'EOF':
if cur()['type'] == 'LET':
advance()
if cur()['type'] != 'IDENT':
report(['IDENT'])
recover()
continue
advance()
if cur()['type'] != 'EQ':
report(['='])
recover()
continue
advance()
if cur()['type'] != 'NUMBER':
report(['NUMBER'])
recover()
continue
advance()
if cur()['type'] != 'SEMI':
report([';'])
recover()
continue
advance()
valid += 1
elif cur()['type'] == 'PRINT':
advance()
if cur()['type'] != 'IDENT':
report(['IDENT'])
recover()
continue
advance()
if cur()['type'] != 'SEMI':
report([';'])
recover()
continue
advance()
valid += 1
else:
report(['let', 'print'])
recover()
return {'valid_statements': valid, 'errors': errors}Time complexity: O(n), where n is the length of the input. Space complexity: O(n).
Hints
- Keep token positions from the lexer so the parser can report exact character indices.
- After the first error in a statement, do not try to continue parsing that same statement.
Part 7: Parse Expressions Using Declared Operator Precedence and Associativity
Constraints
- 1 <= number of declared operators <= 20
- Operators in tests do not overlap with identifiers or numbers
- Expressions in tests are syntactically valid
Examples
Input: {'declarations': [['left', ['+', '-']], ['left', ['*', '/']], ['right', ['^']]], 'expression': '1 + 2 * 3'}
Expected Output: ('+', 1, ('*', 2, 3))
Explanation: Multiplication binds tighter than addition.
Input: {'declarations': [['left', ['+', '-']], ['left', ['*', '/']], ['right', ['^']]], 'expression': '2 ^ 3 ^ 2'}
Expected Output: ('^', 2, ('^', 3, 2))
Explanation: Right associativity makes exponentiation nest to the right.
Input: {'declarations': [['right', ['+']]], 'expression': 'a+b+c'}
Expected Output: ('+', 'a', ('+', 'b', 'c'))
Explanation: The declaration forces + to associate to the right.
Input: {'declarations': [['left', ['+', '-']], ['left', ['*', '/']]], 'expression': '(1+2)*3'}
Expected Output: ('*', ('+', 1, 2), 3)
Explanation: Parentheses override precedence.
Input: {'declarations': [['left', ['+']]], 'expression': '42'}
Expected Output: 42
Explanation: Edge case: a single operand needs no operator node.
Solution
def solution(data):
declarations = data['declarations']
expr = data['expression']
precedence = {}
assoc = {}
level = 1
operators = []
for item in declarations:
a = item[0]
ops = item[1]
for op in ops:
precedence[op] = level
assoc[op] = a
operators.append(op)
level += 1
operators.sort(key=lambda x: (-len(x), x))
def lex(text):
tokens = []
i = 0
while i < len(text):
c = text[i]
if c.isspace():
i += 1
elif c.isdigit():
j = i
while j < len(text) and text[j].isdigit():
j += 1
tokens.append(('NUMBER', int(text[i:j])))
i = j
elif c.isalpha() or c == '_':
j = i
while j < len(text) and (text[j].isalnum() or text[j] == '_'):
j += 1
tokens.append(('IDENT', text[i:j]))
i = j
elif c == '(':
tokens.append(('LP', '('))
i += 1
elif c == ')':
tokens.append(('RP', ')'))
i += 1
else:
matched = None
for op in operators:
if text.startswith(op, i):
matched = op
break
if matched is None:
raise ValueError('invalid token')
tokens.append(('OP', matched))
i += len(matched)
tokens.append(('EOF', 'EOF'))
return tokens
tokens = lex(expr)
i = 0
def peek():
return tokens[i]
def eat():
nonlocal i
tok = tokens[i]
i += 1
return tok
def parse_primary():
tok = peek()
if tok[0] == 'NUMBER':
eat()
return tok[1]
if tok[0] == 'IDENT':
eat()
return tok[1]
if tok[0] == 'LP':
eat()
node = parse_expr(1)
if peek()[0] != 'RP':
raise ValueError('missing )')
eat()
return node
raise ValueError('expected operand')
def parse_expr(min_prec):
left = parse_primary()
while True:
tok = peek()
if tok[0] != 'OP':
break
op = tok[1]
prec = precedence[op]
if prec < min_prec:
break
eat()
next_min = prec + 1 if assoc[op] == 'left' else prec
right = parse_expr(next_min)
left = (op, left, right)
return left
ast = parse_expr(1)
if peek()[0] != 'EOF':
raise ValueError('extra input')
return astTime complexity: O(n * k) in the lexer plus O(n) in the parser, where n is expression length and k is the number of distinct operators. Space complexity: O(n).
Hints
- A precedence-climbing parser is a good fit here.
- For left associativity, recurse with precedence + 1; for right associativity, recurse with the same precedence.
Part 8: Choose the Best Parser Under Time and Memory Cost Models
Constraints
- Candidate names are drawn from LL1, LR1, Earley, and PEG
- 0 <= number of workloads <= 1000
- All numeric inputs are non-negative integers
Examples
Input: {'candidates': ['LL1', 'LR1', 'Earley', 'PEG'], 'n': 3, 'p': 5, 'l': 12, 'm': 6, 'token_counts': [10, 20, 5], 'memory_budget': 100, 'ambiguous': False}
Expected Output: {'parser': 'LL1', 'time': 89, 'space': 18}
Explanation: LL1 has the lowest feasible total cost.
Input: {'candidates': ['Earley', 'PEG'], 'n': 2, 'p': 2, 'l': 3, 'm': 2, 'token_counts': [1], 'memory_budget': 10, 'ambiguous': False}
Expected Output: {'parser': 'Earley', 'time': 4, 'space': 1}
Explanation: For this tiny workload, Earley is cheaper than PEG.
Input: {'candidates': ['LL1', 'PEG'], 'n': 4, 'p': 4, 'l': 10, 'm': 5, 'token_counts': [20], 'memory_budget': 15, 'ambiguous': False}
Expected Output: {'parser': None, 'time': None, 'space': None}
Explanation: Edge case: no candidate fits within the memory budget.
Input: {'candidates': ['LL1', 'LR1'], 'n': 2, 'p': 3, 'l': 5, 'm': 4, 'token_counts': [], 'memory_budget': 20, 'ambiguous': False}
Expected Output: {'parser': 'LL1', 'time': 18, 'space': 8}
Explanation: With no programs to parse, only build cost matters.
Solution
def solution(data):
candidates = data['candidates']
n = data['n']
p = data['p']
l = data['l']
m = data['m']
token_counts = data['token_counts']
memory_budget = data['memory_budget']
ambiguous = data['ambiguous']
max_t = max(token_counts) if token_counts else 0
best = None
for name in candidates:
if name == 'LL1':
total_time = n * l + n * m + sum(token_counts)
space = n * m
elif name == 'LR1':
total_time = p * l + n * m + sum(token_counts)
space = p * m + n
elif name == 'Earley':
parse_cost = 0
for t in token_counts:
parse_cost += t ** 3 if ambiguous else t ** 2
total_time = l + parse_cost
space = max_t ** 2
elif name == 'PEG':
total_time = l + sum(l * t for t in token_counts)
space = l + max_t
else:
continue
if space > memory_budget:
continue
candidate = (total_time, space, name)
if best is None or candidate < best:
best = candidate
if best is None:
return {'parser': None, 'time': None, 'space': None}
return {'parser': best[2], 'time': best[0], 'space': best[1]}Time complexity: O(C + W), where C is the number of candidates and W is the number of workloads. Space complexity: O(1).
Hints
- Build cost is paid once, but parse cost is paid for every token count in token_counts.
- For Earley and PEG, working memory depends on the largest input, not the sum.
Part 9: Stream and Parse Statements Incrementally from Input Chunks
Constraints
- 0 <= total input length <= 100000
- Completed statements in tests are syntactically valid
- The remainder may contain an incomplete trailing statement
Examples
Input: ['x = 1 +', ' 2;', 'y = x * 3;']
Expected Output: {'statements': [{'type': 'Assign', 'name': 'x', 'value': {'type': 'BinOp', 'op': '+', 'left': {'type': 'Number', 'value': 1}, 'right': {'type': 'Number', 'value': 2}}}, {'type': 'Assign', 'name': 'y', 'value': {'type': 'BinOp', 'op': '*', 'left': {'type': 'Identifier', 'name': 'x'}, 'right': {'type': 'Number', 'value': 3}}}], 'remainder': ''}
Explanation: The first statement is completed across chunk boundaries.
Input: ['(1+2', ')*3;', 'z=4']
Expected Output: {'statements': [{'type': 'ExprStmt', 'expr': {'type': 'BinOp', 'op': '*', 'left': {'type': 'BinOp', 'op': '+', 'left': {'type': 'Number', 'value': 1}, 'right': {'type': 'Number', 'value': 2}}, 'right': {'type': 'Number', 'value': 3}}}], 'remainder': 'z=4'}
Explanation: Only the first top-level semicolon-terminated statement is complete.
Input: []
Expected Output: {'statements': [], 'remainder': ''}
Explanation: Edge case: no input chunks.
Input: ['x=1;', 'y=2', '+3;']
Expected Output: {'statements': [{'type': 'Assign', 'name': 'x', 'value': {'type': 'Number', 'value': 1}}, {'type': 'Assign', 'name': 'y', 'value': {'type': 'BinOp', 'op': '+', 'left': {'type': 'Number', 'value': 2}, 'right': {'type': 'Number', 'value': 3}}}], 'remainder': ''}
Explanation: The second assignment is incomplete until the final chunk arrives.
Solution
def solution(chunks):
def lex(text):
tokens = []
i = 0
while i < len(text):
c = text[i]
if c.isspace():
i += 1
elif c.isdigit():
j = i
while j < len(text) and text[j].isdigit():
j += 1
tokens.append(('NUMBER', int(text[i:j])))
i = j
elif c.isalpha() or c == '_':
j = i
while j < len(text) and (text[j].isalnum() or text[j] == '_'):
j += 1
tokens.append(('IDENT', text[i:j]))
i = j
elif c in '+-*/=();':
tokens.append((c, c))
i += 1
else:
raise ValueError('invalid character')
tokens.append(('EOF', 'EOF'))
return tokens
def parse_statement_text(text):
tokens = lex(text)
i = 0
def peek(k=0):
return tokens[i + k][0]
def eat(expected=None):
nonlocal i
tok = tokens[i]
if expected is not None and tok[0] != expected:
raise ValueError('syntax error')
i += 1
return tok
def parse_factor():
if peek() == 'NUMBER':
return {'type': 'Number', 'value': eat('NUMBER')[1]}
if peek() == 'IDENT':
return {'type': 'Identifier', 'name': eat('IDENT')[1]}
if peek() == '(':
eat('(')
node = parse_expr()
eat(')')
return node
raise ValueError('expected factor')
def parse_term():
node = parse_factor()
while peek() in ('*', '/'):
op = eat()[0]
right = parse_factor()
node = {'type': 'BinOp', 'op': op, 'left': node, 'right': right}
return node
def parse_expr():
node = parse_term()
while peek() in ('+', '-'):
op = eat()[0]
right = parse_term()
node = {'type': 'BinOp', 'op': op, 'left': node, 'right': right}
return node
if peek() == 'IDENT' and peek(1) == '=':
name = eat('IDENT')[1]
eat('=')
value = parse_expr()
eat(';')
eat('EOF')
return {'type': 'Assign', 'name': name, 'value': value}
expr = parse_expr()
eat(';')
eat('EOF')
return {'type': 'ExprStmt', 'expr': expr}
statements = []
current = []
depth = 0
for chunk in chunks:
for ch in chunk:
current.append(ch)
if ch == '(':
depth += 1
elif ch == ')':
if depth > 0:
depth -= 1
elif ch == ';' and depth == 0:
text = ''.join(current).strip()
if text:
statements.append(parse_statement_text(text))
current = []
return {'statements': statements, 'remainder': ''.join(current)}Time complexity: O(n), where n is the total number of characters across all chunks. Space complexity: O(n).
Hints
- Track parenthesis depth while reading the stream.
- You only need to parse a statement once its terminating semicolon appears at depth 0.