How do I practice coding and algorithm questions?

Use PracHub's coding console to write, test, and debug your solutions in Python or JavaScript. View hints, test against sample inputs, and compare with official solutions.

What difficulty level is this coding question?

This is a Medium difficulty Coding & Algorithms question, commonly asked during Onsite rounds at OpenAI.

What role is this question designed for?

This question is commonly asked for Software Engineer candidates at OpenAI during technical interviews.

Design a parser for a hypothetical language

Company: OpenAI

Role: Software Engineer

Category: Coding & Algorithms

Difficulty: Medium

Interview Round: Onsite

You are given a set of syntax rules for a hypothetical programming language written in BNF/EBNF. Design data structures and algorithms that: ( 1) parse the grammar into an internal representation (terminals, nonterminals, productions, precedence/associativity); ( 2) detect issues such as left recursion, ambiguity, unreachable or non-productive symbols; ( 3) apply transformations (left-recursion elimination, left factoring) to enable efficient parsing; ( 4) compute FIRST/FOLLOW sets and determine whether the grammar is LL ( 1); if not LL ( 1), justify choosing LR ( 1), GLR, Earley, or PEG parsing and outline construction; ( 5) implement a lexer and parser that build an abstract syntax tree (AST) for input programs; ( 6) provide error reporting and recovery with expected-token hints; ( 7) support operator precedence and associativity declarations; ( 8) analyze time and space complexity for grammar processing and parsing, and discuss trade-offs; ( 9) optionally support incremental or streaming parsing. Provide pseudocode and explain your design choices.

Quick Answer: This question evaluates competency in compiler and language design, covering grammar analysis, parsing algorithms (LL/LR/GLR/Earley/PEG), lexer and parser implementation, AST construction, error reporting and recovery, operator precedence and associativity, and analysis of time/space trade-offs.

Part 1: Parse a Grammar Specification into an Internal Representation

You are given a list of grammar specification lines. Each line is either a precedence declaration such as %left + - or a production such as Expr -> Expr + Term | Term. Parse the specification into an internal representation containing the set of nonterminals, the set of terminals, the flattened production list, and the precedence/associativity table. Symbols are separated by spaces. Epsilon is written as ε. Any symbol that appears on a left-hand side is a nonterminal; every other non-ε symbol appearing on a right-hand side is a terminal.

Constraints

0 <= number of lines <= 200
Each production contains exactly one ->
Symbols are whitespace-separated and epsilon is represented by 'ε'

Examples

Expected Output: {'nonterminals': ['Expr', 'Factor', 'Term'], 'terminals': ['(', ')', '*', '+', '-', '/', 'num'], 'productions': [{'lhs': 'Expr', 'rhs': ['Expr', '+', 'Term']}, {'lhs': 'Expr', 'rhs': ['Expr', '-', 'Term']}, {'lhs': 'Expr', 'rhs': ['Term']}, {'lhs': 'Term', 'rhs': ['Term', '*', 'Factor']}, {'lhs': 'Term', 'rhs': ['Term', '/', 'Factor']}, {'lhs': 'Term', 'rhs': ['Factor']}, {'lhs': 'Factor', 'rhs': ['num']}, {'lhs': 'Factor', 'rhs': ['(', 'Expr', ')']}], 'precedence': {'+': {'assoc': 'left', 'level': 1}, '-': {'assoc': 'left', 'level': 1}, '*': {'assoc': 'left', 'level': 2}, '/': {'assoc': 'left', 'level': 2}}}

Explanation: This includes both directives and production rules.

Input: []

Expected Output: {'nonterminals': [], 'terminals': [], 'productions': [], 'precedence': {}}

Explanation: Edge case: empty specification.

Input: ['S -> ε']

Expected Output: {'nonterminals': ['S'], 'terminals': [], 'productions': [{'lhs': 'S', 'rhs': ['ε']}], 'precedence': {}}

Explanation: A grammar with only epsilon has no terminals.

Input: ['%right ^', 'Atom -> id | num', 'Pow -> Atom ^ Pow | Atom']

Expected Output: {'nonterminals': ['Atom', 'Pow'], 'terminals': ['^', 'id', 'num'], 'productions': [{'lhs': 'Atom', 'rhs': ['id']}, {'lhs': 'Atom', 'rhs': ['num']}, {'lhs': 'Pow', 'rhs': ['Atom', '^', 'Pow']}, {'lhs': 'Pow', 'rhs': ['Atom']}], 'precedence': {'^': {'assoc': 'right', 'level': 1}}}

Explanation: A single precedence declaration assigns level 1.

Solution

def solution(lines):
    precedence = {}
    parsed_rules = []
    nonterminals = set()
    level = 0

    for raw in lines:
        line = raw.strip()
        if not line:
            continue
        if line.startswith('%'):
            parts = line.split()
            assoc = parts[0][1:]
            level += 1
            for tok in parts[1:]:
                precedence[tok] = {'assoc': assoc, 'level': level}
        else:
            if '->' not in line:
                continue
            lhs, rhs = line.split('->', 1)
            lhs = lhs.strip()
            nonterminals.add(lhs)
            parsed_rules.append((lhs, rhs.strip()))

    productions = []
    for lhs, rhs_text in parsed_rules:
        for alt in rhs_text.split('|'):
            alt = alt.strip()
            rhs = alt.split() if alt else ['ε']
            productions.append({'lhs': lhs, 'rhs': rhs or ['ε']})

    terminals = set()
    for prod in productions:
        for sym in prod['rhs']:
            if sym != 'ε' and sym not in nonterminals:
                terminals.add(sym)

    return {
        'nonterminals': sorted(nonterminals),
        'terminals': sorted(terminals),
        'productions': productions,
        'precedence': precedence
    }

Time complexity: O(L), where L is the total number of symbols/tokens across all lines. Space complexity: O(L).

Hints

First collect every left-hand side before classifying terminals.
Flatten alternatives after splitting each right-hand side on |.

Part 2: Detect Left Recursion, Unreachable Symbols, Non-Productive Symbols, and FIRST-Set Conflicts

Given a context-free grammar, detect four classes of issues: left-recursive nonterminals, unreachable nonterminals, non-productive nonterminals, and ambiguous nonterminals under a simplified heuristic. The ambiguity heuristic for this problem is: a nonterminal is flagged if two of its productions have overlapping FIRST sets. Epsilon is written as ε.

Constraints

1 <= number of nonterminals <= 100
Each production is a list of strings
The start symbol always exists in the grammar

Examples

Input: ('S', {'S': [['S', 'a'], ['A'], ['b']], 'A': [['ε'], ['c']], 'B': [['d']]})

Expected Output: {'left_recursive': ['S'], 'unreachable': ['B'], 'non_productive': [], 'ambiguous_nonterminals': ['S']}

Explanation: S is directly left-recursive through S -> S a. B is never reached from the start symbol. All nonterminals are productive. S is ambiguous under the heuristic because FIRST(S a) overlaps with FIRST(A) and FIRST(b).

Input: ('S', {'S': [['A', 'm']], 'A': [['B']], 'B': [['S'], ['t'], ['t', 'u']], 'C': [['D']], 'D': [['C']]})

Expected Output: {'left_recursive': ['A', 'B', 'S'], 'unreachable': ['C', 'D'], 'non_productive': ['C', 'D'], 'ambiguous_nonterminals': ['B']}

Explanation: S, A, and B form a productive left-recursive cycle. C and D form a dead cycle, so they are unreachable and non-productive, but they should not be reported as left-recursive for this problem. B is ambiguous because FIRST(S), FIRST(t), and FIRST(t u) all contain 't'.

Input: ('S', {'S': [['A']], 'A': [['B']], 'B': [['C']], 'C': [['C']]})

Expected Output: {'left_recursive': [], 'unreachable': [], 'non_productive': ['A', 'B', 'C', 'S'], 'ambiguous_nonterminals': []}

Explanation: Every nonterminal is reachable from S, but none can derive any terminal string. The self-cycle at C is a dead cycle, so no nonterminal is reported as left-recursive under the productive-left-recursion rule.

Input: ('S', {'S': [['ε']]})

Expected Output: {'left_recursive': [], 'unreachable': [], 'non_productive': [], 'ambiguous_nonterminals': []}

Explanation: The grammar contains a single productive epsilon production. There is no recursion, no unreachable symbol, and only one production, so there is no FIRST-set conflict.

Solution

def solution(start_symbol, grammar):
    EPS = 'ε'
    nonterminals = set(grammar.keys())

    def seq_is_nullable(prod, nullable):
        if not prod:
            return True
        for sym in prod:
            if sym == EPS:
                continue
            if sym in nonterminals and sym in nullable:
                continue
            return False
        return True

    def seq_is_productive(prod, productive):
        if not prod:
            return True
        for sym in prod:
            if sym == EPS:
                continue
            if sym in nonterminals:
                if sym not in productive:
                    return False
            else:
                continue
        return True

    nullable = set()
    changed = True
    while changed:
        changed = False
        for nt, productions in grammar.items():
            if nt in nullable:
                continue
            for prod in productions:
                if seq_is_nullable(prod, nullable):
                    nullable.add(nt)
                    changed = True
                    break

    productive = set()
    changed = True
    while changed:
        changed = False
        for nt, productions in grammar.items():
            if nt in productive:
                continue
            for prod in productions:
                if seq_is_productive(prod, productive):
                    productive.add(nt)
                    changed = True
                    break

    first = {nt: set() for nt in nonterminals}

    def first_of_sequence(prod):
        if not prod:
            return {EPS}
        result = set()
        all_nullable = True
        for sym in prod:
            if sym == EPS:
                continue
            if sym in nonterminals:
                result |= (first[sym] - {EPS})
                if sym in nullable:
                    continue
                all_nullable = False
                break
            else:
                result.add(sym)
                all_nullable = False
                break
        if all_nullable:
            result.add(EPS)
        return result

    changed = True
    while changed:
        changed = False
        for nt, productions in grammar.items():
            for prod in productions:
                seq_first = first_of_sequence(prod)
                if not seq_first.issubset(first[nt]):
                    first[nt] |= seq_first
                    changed = True

    reachable = set()
    stack = [start_symbol] if start_symbol in nonterminals else []
    while stack:
        nt = stack.pop()
        if nt in reachable:
            continue
        reachable.add(nt)
        for prod in grammar.get(nt, []):
            for sym in prod:
                if sym in nonterminals and sym not in reachable:
                    stack.append(sym)

    unreachable = sorted(nonterminals - reachable)
    non_productive = sorted(nonterminals - productive)

    ambiguous = []
    for nt, productions in grammar.items():
        seen = set()
        conflict = False
        for prod in productions:
            prod_first = first_of_sequence(prod)
            if prod_first & seen:
                conflict = True
                break
            seen |= prod_first
        if conflict:
            ambiguous.append(nt)
    ambiguous.sort()

    graph = {nt: set() for nt in productive}
    for nt in productive:
        for prod in grammar[nt]:
            if not prod:
                continue
            for sym in prod:
                if sym == EPS:
                    continue
                if sym in nonterminals:
                    if sym in productive:
                        graph[nt].add(sym)
                    if sym in nullable:
                        continue
                    break
                else:
                    break

    def is_left_recursive(start):
        stack = list(graph.get(start, ()))
        visited = set()
        while stack:
            cur = stack.pop()
            if cur == start:
                return True
            if cur in visited:
                continue
            visited.add(cur)
            stack.extend(graph.get(cur, ()))
        return False

    left_recursive = sorted(nt for nt in productive if is_left_recursive(nt))

    return {
        'left_recursive': left_recursive,
        'unreachable': unreachable,
        'non_productive': non_productive,
        'ambiguous_nonterminals': ambiguous
    }

Time complexity: O(N^3) worst-case, where N is the total size of the grammar representation. Space complexity: O(N^2) worst-case.

Hints

Unreachable symbols come from a graph traversal starting at the start symbol.
A productive nonterminal is one that can eventually derive only terminals and/or ε.

Part 3: Transform a Grammar by Eliminating Direct Left Recursion and Left-Factoring

Given a grammar, transform it in two steps: first eliminate direct left recursion for every nonterminal, then repeatedly left-factor any set of productions that share the same first symbol. For this problem, left factoring is only by the first symbol, repeated until no such conflict remains. Use A_rec for the helper created by left-recursion elimination and F1, F2, ... for factoring helpers.

Constraints

1 <= number of nonterminals <= 50
Only direct left recursion must be eliminated
Factoring is repeated only on common first symbols

Examples

Input: {'A': [['A', 'a'], ['A', 'b'], ['c'], ['d']]}

Expected Output: {'A': [['c', 'A_rec'], ['d', 'A_rec']], 'A_rec': [['a', 'A_rec'], ['b', 'A_rec'], ['ε']]}

Explanation: Direct left recursion is eliminated in the standard way.

Input: {'S': [['a', 'b', 'c'], ['a', 'b', 'd'], ['a', 'e']]}

Expected Output: {'S': [['a', 'F1']], 'F1': [['b', 'F2'], ['e']], 'F2': [['c'], ['d']]}

Explanation: Factoring happens twice: first on a, then inside F1 on b.

Input: {'S': [['S', 'a'], ['b'], ['b', 'c']]}

Expected Output: {'S': [['b', 'F1']], 'S_rec': [['a', 'S_rec'], ['ε']], 'F1': [['S_rec'], ['c', 'S_rec']]}

Explanation: First remove direct left recursion, then factor the two b-starting productions.

Input: {'S': [['a'], ['b']]}

Expected Output: {'S': [['a'], ['b']]}

Explanation: Edge case: no transformation is needed.

Solution

def solution(grammar):
    g = {k: [prod[:] for prod in v] for k, v in grammar.items()}
    transformed = {}

    for nt in sorted(g):
        prods = g[nt]
        recursive = [prod[1:] for prod in prods if prod and prod[0] == nt]
        non_recursive = [prod[:] for prod in prods if not (prod and prod[0] == nt)]
        if recursive:
            helper = nt + '_rec'
            betas = non_recursive if non_recursive else [['ε']]
            transformed[nt] = []
            for beta in betas:
                seq = [] if beta == ['ε'] else beta[:]
                seq.append(helper)
                transformed[nt].append(seq)
            transformed[helper] = []
            for alpha in recursive:
                seq = [] if alpha == ['ε'] else alpha[:]
                seq.append(helper)
                transformed[helper].append(seq)
            transformed[helper].append(['ε'])
        else:
            transformed[nt] = [prod[:] for prod in prods]

    counter = 1
    while True:
        changed = False
        for nt in sorted(list(transformed.keys())):
            prods = transformed[nt]
            groups = {}
            for idx, prod in enumerate(prods):
                key = None if prod == ['ε'] or not prod else prod[0]
                groups.setdefault(key, []).append((idx, prod))

            factorable = sorted(sym for sym, items in groups.items() if sym is not None and len(items) > 1)
            if not factorable:
                continue

            sym = factorable[0]
            helper = 'F' + str(counter)
            counter += 1
            used = {idx for idx, _ in groups[sym]}
            new_nt_prods = []
            helper_prods = []
            inserted = False

            for idx, prod in enumerate(prods):
                if idx in used:
                    if not inserted:
                        new_nt_prods.append([sym, helper])
                        inserted = True
                    rest = prod[1:]
                    helper_prods.append(rest if rest else ['ε'])
                else:
                    new_nt_prods.append(prod)

            transformed[nt] = new_nt_prods
            transformed[helper] = helper_prods
            changed = True
            break
        if not changed:
            break

    return transformed

Time complexity: O(R^2 + L) in the worst case, where R is the number of productions and L is the total RHS length. Space complexity: O(R + L).

Hints

Split A productions into recursive ones of the form A -> A α and non-recursive ones A -> β.
After creating one factoring helper, restart the scan because the new grammar may contain another factorable nonterminal.

Part 4: Compute FIRST and FOLLOW Sets, Check LL(1), and Choose an Alternative Parser

Given a grammar and a start symbol, compute FIRST and FOLLOW sets for every nonterminal, determine whether the grammar is LL(1), and return a deterministic parser choice. Use this choice rule: return LL(1) if the grammar is LL(1); otherwise choose LR(1) if left recursion exists, PEG if there is a FIRST/FIRST conflict without left recursion, and Earley otherwise. Epsilon is written as ε and end-of-input is $.

Constraints

1 <= number of nonterminals <= 100
Epsilon is represented by 'ε'
All productions are lists of symbols

Examples

Input: {'start': 'E', 'grammar': {'E': [['T', 'Ep']], 'Ep': [['+', 'T', 'Ep'], ['ε']], 'T': [['id'], ['(', 'E', ')']]}}

Expected Output: {'first': {'E': ['(', 'id'], 'Ep': ['+', 'ε'], 'T': ['(', 'id']}, 'follow': {'E': ['$', ')'], 'Ep': ['$', ')'], 'T': ['$', ')', '+']}, 'is_ll1': True, 'alternative': 'LL(1)', 'reason': 'No LL(1) conflicts'}

Explanation: This is a standard LL(1) expression grammar.

Input: {'start': 'E', 'grammar': {'E': [['E', '+', 'T'], ['T']], 'T': [['id']]}}

Expected Output: {'first': {'E': ['id'], 'T': ['id']}, 'follow': {'E': ['$', '+'], 'T': ['$', '+']}, 'is_ll1': False, 'alternative': 'LR(1)', 'reason': 'Left recursion detected'}

Explanation: The grammar is not LL(1) and left recursion suggests an LR-style parser.

Input: {'start': 'S', 'grammar': {'S': [['a', 'A'], ['a', 'B']], 'A': [['c']], 'B': [['d']]}}

Expected Output: {'first': {'A': ['c'], 'B': ['d'], 'S': ['a']}, 'follow': {'A': ['$'], 'B': ['$'], 'S': ['$']}, 'is_ll1': False, 'alternative': 'PEG', 'reason': 'Productions share a prefix token'}

Explanation: There is a FIRST/FIRST conflict but no left recursion.

Input: {'start': 'S', 'grammar': {'S': [['A', 'b']], 'A': [['ε'], ['b']]}}

Expected Output: {'first': {'A': ['b', 'ε'], 'S': ['b']}, 'follow': {'A': ['b'], 'S': ['$']}, 'is_ll1': False, 'alternative': 'Earley', 'reason': 'Nullable/FOLLOW conflict requires a more general parser'}

Explanation: The conflict arises from epsilon and FOLLOW, not from a shared non-epsilon FIRST token.

Solution

def solution(data):
    grammar = {k: [prod[:] for prod in v] for k, v in data['grammar'].items()}
    start = data['start']
    nonterminals = set(grammar)

    first = {nt: set() for nt in nonterminals}
    changed = True
    while changed:
        changed = False
        for nt, prods in grammar.items():
            for prod in prods:
                if prod == ['ε'] or not prod:
                    if 'ε' not in first[nt]:
                        first[nt].add('ε')
                        changed = True
                    continue
                all_nullable = True
                for sym in prod:
                    if sym == 'ε':
                        if 'ε' not in first[nt]:
                            first[nt].add('ε')
                            changed = True
                        break
                    if sym not in nonterminals:
                        if sym not in first[nt]:
                            first[nt].add(sym)
                            changed = True
                        all_nullable = False
                        break
                    before = len(first[nt])
                    first[nt] |= (first[sym] - {'ε'})
                    if len(first[nt]) != before:
                        changed = True
                    if 'ε' in first[sym]:
                        continue
                    all_nullable = False
                    break
                else:
                    if all_nullable and 'ε' not in first[nt]:
                        first[nt].add('ε')
                        changed = True

    def first_of_seq(seq):
        if seq == ['ε'] or not seq:
            return {'ε'}
        out = set()
        all_nullable = True
        for sym in seq:
            if sym == 'ε':
                out.add('ε')
                return out
            if sym not in nonterminals:
                out.add(sym)
                return out
            out |= (first[sym] - {'ε'})
            if 'ε' in first[sym]:
                continue
            all_nullable = False
            break
        else:
            all_nullable = True
        if all_nullable:
            out.add('ε')
        return out

    follow = {nt: set() for nt in nonterminals}
    follow[start].add('$')
    changed = True
    while changed:
        changed = False
        for nt, prods in grammar.items():
            for prod in prods:
                for i, sym in enumerate(prod):
                    if sym not in nonterminals:
                        continue
                    beta = prod[i + 1:]
                    fb = first_of_seq(beta)
                    before = len(follow[sym])
                    follow[sym] |= (fb - {'ε'})
                    if 'ε' in fb or not beta:
                        follow[sym] |= follow[nt]
                    if len(follow[sym]) != before:
                        changed = True

    nullable = {nt for nt in nonterminals if 'ε' in first[nt]}
    graph = {nt: set() for nt in nonterminals}
    for nt, prods in grammar.items():
        for prod in prods:
            prefix_nullable = True
            for sym in prod:
                if sym == 'ε':
                    continue
                if sym in nonterminals:
                    if prefix_nullable:
                        graph[nt].add(sym)
                    if sym in nullable:
                        continue
                    break
                break

    left_recursive = False
    for nt in nonterminals:
        stack = list(graph[nt])
        seen = set()
        while stack:
            cur = stack.pop()
            if cur == nt:
                left_recursive = True
                stack = []
                break
            if cur in seen:
                continue
            seen.add(cur)
            stack.extend(graph[cur] - seen)
        if left_recursive:
            break

    is_ll1 = True
    first_first_conflict = False
    for nt, prods in grammar.items():
        prod_firsts = [first_of_seq(prod) for prod in prods]
        for i in range(len(prod_firsts)):
            for j in range(i + 1, len(prod_firsts)):
                if (prod_firsts[i] - {'ε'}) & (prod_firsts[j] - {'ε'}):
                    is_ll1 = False
                    first_first_conflict = True
                if 'ε' in prod_firsts[i] and 'ε' in prod_firsts[j]:
                    is_ll1 = False
                if 'ε' in prod_firsts[i] and (follow[nt] & (prod_firsts[j] - {'ε'})):
                    is_ll1 = False
                if 'ε' in prod_firsts[j] and (follow[nt] & (prod_firsts[i] - {'ε'})):
                    is_ll1 = False

    if is_ll1:
        alternative = 'LL(1)'
        reason = 'No LL(1) conflicts'
    elif left_recursive:
        alternative = 'LR(1)'
        reason = 'Left recursion detected'
    elif first_first_conflict:
        alternative = 'PEG'
        reason = 'Productions share a prefix token'
    else:
        alternative = 'Earley'
        reason = 'Nullable/FOLLOW conflict requires a more general parser'

    return {
        'first': {nt: sorted(first[nt]) for nt in sorted(grammar)},
        'follow': {nt: sorted(follow[nt]) for nt in sorted(grammar)},
        'is_ll1': is_ll1,
        'alternative': alternative,
        'reason': reason
    }

Time complexity: O(N * L + N^2), where N is the number of nonterminals and L is the total RHS length. Space complexity: O(N + L).

Hints

For LL(1), check both FIRST/FIRST conflicts and epsilon-related FIRST/FOLLOW conflicts.
FOLLOW of the start symbol always contains $.

Part 5: Build an AST with a Lexer and Recursive-Descent Parser

Implement a lexer and parser for a tiny language of arithmetic expressions and assignments. The grammar is: program -> stmt*; stmt -> IDENT = expr ; | expr ; ; expr -> term ((+|-) term)* ; term -> factor ((*|/) factor)* ; factor -> NUMBER | IDENT | ( expr ). Return the abstract syntax tree.

Constraints

0 <= input length <= 100000
Numbers are non-negative integers
Input programs in tests are syntactically valid

Examples

Input: 'x = 1 + 2 * 3; y = (x + 4);'

Expected Output: {'type': 'Program', 'body': [{'type': 'Assign', 'name': 'x', 'value': {'type': 'BinOp', 'op': '+', 'left': {'type': 'Number', 'value': 1}, 'right': {'type': 'BinOp', 'op': '*', 'left': {'type': 'Number', 'value': 2}, 'right': {'type': 'Number', 'value': 3}}}}, {'type': 'Assign', 'name': 'y', 'value': {'type': 'BinOp', 'op': '+', 'left': {'type': 'Identifier', 'name': 'x'}, 'right': {'type': 'Number', 'value': 4}}}]}

Explanation: Multiplication binds tighter than addition.

Input: '1 + 2;'

Expected Output: {'type': 'Program', 'body': [{'type': 'ExprStmt', 'expr': {'type': 'BinOp', 'op': '+', 'left': {'type': 'Number', 'value': 1}, 'right': {'type': 'Number', 'value': 2}}}]}

Explanation: A bare expression becomes an ExprStmt.

Input: ''

Expected Output: {'type': 'Program', 'body': []}

Explanation: Edge case: empty program.

Input: 'z = (1 + 2) * (3 + 4);'

Expected Output: {'type': 'Program', 'body': [{'type': 'Assign', 'name': 'z', 'value': {'type': 'BinOp', 'op': '*', 'left': {'type': 'BinOp', 'op': '+', 'left': {'type': 'Number', 'value': 1}, 'right': {'type': 'Number', 'value': 2}}, 'right': {'type': 'BinOp', 'op': '+', 'left': {'type': 'Number', 'value': 3}, 'right': {'type': 'Number', 'value': 4}}}}]}

Explanation: Parentheses force grouping.

Solution

def solution(program):
    def lex(text):
        tokens = []
        i = 0
        while i < len(text):
            c = text[i]
            if c.isspace():
                i += 1
            elif c.isdigit():
                j = i
                while j < len(text) and text[j].isdigit():
                    j += 1
                tokens.append(('NUMBER', int(text[i:j])))
                i = j
            elif c.isalpha() or c == '_':
                j = i
                while j < len(text) and (text[j].isalnum() or text[j] == '_'):
                    j += 1
                tokens.append(('IDENT', text[i:j]))
                i = j
            elif c in '+-*/=();':
                tokens.append((c, c))
                i += 1
            else:
                raise ValueError('invalid character')
        tokens.append(('EOF', 'EOF'))
        return tokens

    tokens = lex(program)
    i = 0

    def peek(k=0):
        return tokens[i + k][0]

    def eat(expected=None):
        nonlocal i
        tok = tokens[i]
        if expected is not None and tok[0] != expected:
            raise ValueError('syntax error')
        i += 1
        return tok

    def parse_factor():
        if peek() == 'NUMBER':
            return {'type': 'Number', 'value': eat('NUMBER')[1]}
        if peek() == 'IDENT':
            return {'type': 'Identifier', 'name': eat('IDENT')[1]}
        if peek() == '(':
            eat('(')
            node = parse_expr()
            eat(')')
            return node
        raise ValueError('expected factor')

    def parse_term():
        node = parse_factor()
        while peek() in ('*', '/'):
            op = eat()[0]
            right = parse_factor()
            node = {'type': 'BinOp', 'op': op, 'left': node, 'right': right}
        return node

    def parse_expr():
        node = parse_term()
        while peek() in ('+', '-'):
            op = eat()[0]
            right = parse_term()
            node = {'type': 'BinOp', 'op': op, 'left': node, 'right': right}
        return node

    def parse_stmt():
        if peek() == 'IDENT' and peek(1) == '=':
            name = eat('IDENT')[1]
            eat('=')
            value = parse_expr()
            eat(';')
            return {'type': 'Assign', 'name': name, 'value': value}
        expr = parse_expr()
        eat(';')
        return {'type': 'ExprStmt', 'expr': expr}

    body = []
    while peek() != 'EOF':
        body.append(parse_stmt())
    return {'type': 'Program', 'body': body}

Time complexity: O(n), where n is the length of the input. Space complexity: O(n).

Hints

Tokenize first so the parser only works with token kinds, not raw characters.
Use one function per precedence level: expr, term, factor.

Part 6: Parse Statements with Error Reporting and Semicolon-Based Recovery

Implement a small parser with error recovery. The language contains two statement forms: let IDENT = NUMBER ; and print IDENT ;. When a syntax error occurs, report the character position, the expected token names, and the found token text. Then recover by skipping tokens until the next semicolon and continue parsing the remaining program.

Constraints

0 <= input length <= 100000
Identifiers use letters, digits, and underscores and do not start with a digit
Recovery skips to the next semicolon or EOF

Examples

Input: 'let x = 10; print x;'

Expected Output: {'valid_statements': 2, 'errors': []}

Explanation: Both statements are valid.

Input: 'let = 10; print x;'

Expected Output: {'valid_statements': 1, 'errors': [{'position': 4, 'expected': ['IDENT'], 'found': '='}]}

Explanation: The parser recovers after the bad let statement and still parses print x;.

Input: 'print ; let y = 2;'

Expected Output: {'valid_statements': 1, 'errors': [{'position': 6, 'expected': ['IDENT'], 'found': ';'}]}

Explanation: print must be followed by an identifier.

Input: ''

Expected Output: {'valid_statements': 0, 'errors': []}

Explanation: Edge case: empty program.

Input: 'let x = 1'

Expected Output: {'valid_statements': 0, 'errors': [{'position': 9, 'expected': [';'], 'found': 'EOF'}]}

Explanation: Missing semicolon at end of file.

Solution

def solution(program):
    def lex(text):
        tokens = []
        i = 0
        while i < len(text):
            c = text[i]
            if c.isspace():
                i += 1
            elif c.isalpha() or c == '_':
                j = i
                while j < len(text) and (text[j].isalnum() or text[j] == '_'):
                    j += 1
                word = text[i:j]
                typ = 'IDENT'
                if word == 'let':
                    typ = 'LET'
                elif word == 'print':
                    typ = 'PRINT'
                tokens.append({'type': typ, 'value': word, 'pos': i})
                i = j
            elif c.isdigit():
                j = i
                while j < len(text) and text[j].isdigit():
                    j += 1
                tokens.append({'type': 'NUMBER', 'value': text[i:j], 'pos': i})
                i = j
            elif c == '=':
                tokens.append({'type': 'EQ', 'value': '=', 'pos': i})
                i += 1
            elif c == ';':
                tokens.append({'type': 'SEMI', 'value': ';', 'pos': i})
                i += 1
            else:
                tokens.append({'type': 'INVALID', 'value': c, 'pos': i})
                i += 1
        tokens.append({'type': 'EOF', 'value': 'EOF', 'pos': len(text)})
        return tokens

    tokens = lex(program)
    i = 0
    errors = []
    valid = 0

    def cur():
        return tokens[i]

    def advance():
        nonlocal i
        i += 1

    def report(expected):
        token = cur()
        found = token['value'] if token['type'] != 'EOF' else 'EOF'
        errors.append({'position': token['pos'], 'expected': sorted(expected), 'found': found})

    def recover():
        nonlocal i
        while cur()['type'] not in ('SEMI', 'EOF'):
            i += 1
        if cur()['type'] == 'SEMI':
            i += 1

    while cur()['type'] != 'EOF':
        if cur()['type'] == 'LET':
            advance()
            if cur()['type'] != 'IDENT':
                report(['IDENT'])
                recover()
                continue
            advance()
            if cur()['type'] != 'EQ':
                report(['='])
                recover()
                continue
            advance()
            if cur()['type'] != 'NUMBER':
                report(['NUMBER'])
                recover()
                continue
            advance()
            if cur()['type'] != 'SEMI':
                report([';'])
                recover()
                continue
            advance()
            valid += 1
        elif cur()['type'] == 'PRINT':
            advance()
            if cur()['type'] != 'IDENT':
                report(['IDENT'])
                recover()
                continue
            advance()
            if cur()['type'] != 'SEMI':
                report([';'])
                recover()
                continue
            advance()
            valid += 1
        else:
            report(['let', 'print'])
            recover()

    return {'valid_statements': valid, 'errors': errors}

Time complexity: O(n), where n is the length of the input. Space complexity: O(n).

Hints

Keep token positions from the lexer so the parser can report exact character indices.
After the first error in a statement, do not try to continue parsing that same statement.

Part 7: Parse Expressions Using Declared Operator Precedence and Associativity

You are given operator declarations and an expression. Build an AST that respects the declared precedence and associativity. Declarations are provided from lowest precedence to highest precedence. Operands are integers, identifiers, or parenthesized subexpressions. Return the AST as nested tuples of the form (operator, left, right). Leaf nodes are integers or identifier strings.

Constraints

1 <= number of declared operators <= 20
Operators in tests do not overlap with identifiers or numbers
Expressions in tests are syntactically valid

Examples

Input: {'declarations': [['left', ['+', '-']], ['left', ['*', '/']], ['right', ['^']]], 'expression': '1 + 2 * 3'}

Expected Output: ('+', 1, ('*', 2, 3))

Explanation: Multiplication binds tighter than addition.

Input: {'declarations': [['left', ['+', '-']], ['left', ['*', '/']], ['right', ['^']]], 'expression': '2 ^ 3 ^ 2'}

Expected Output: ('^', 2, ('^', 3, 2))

Explanation: Right associativity makes exponentiation nest to the right.

Input: {'declarations': [['right', ['+']]], 'expression': 'a+b+c'}

Expected Output: ('+', 'a', ('+', 'b', 'c'))

Explanation: The declaration forces + to associate to the right.

Input: {'declarations': [['left', ['+', '-']], ['left', ['*', '/']]], 'expression': '(1+2)*3'}

Expected Output: ('*', ('+', 1, 2), 3)

Explanation: Parentheses override precedence.

Input: {'declarations': [['left', ['+']]], 'expression': '42'}

Expected Output: 42

Explanation: Edge case: a single operand needs no operator node.

Solution

def solution(data):
    declarations = data['declarations']
    expr = data['expression']

    precedence = {}
    assoc = {}
    level = 1
    operators = []
    for item in declarations:
        a = item[0]
        ops = item[1]
        for op in ops:
            precedence[op] = level
            assoc[op] = a
            operators.append(op)
        level += 1
    operators.sort(key=lambda x: (-len(x), x))

    def lex(text):
        tokens = []
        i = 0
        while i < len(text):
            c = text[i]
            if c.isspace():
                i += 1
            elif c.isdigit():
                j = i
                while j < len(text) and text[j].isdigit():
                    j += 1
                tokens.append(('NUMBER', int(text[i:j])))
                i = j
            elif c.isalpha() or c == '_':
                j = i
                while j < len(text) and (text[j].isalnum() or text[j] == '_'):
                    j += 1
                tokens.append(('IDENT', text[i:j]))
                i = j
            elif c == '(':
                tokens.append(('LP', '('))
                i += 1
            elif c == ')':
                tokens.append(('RP', ')'))
                i += 1
            else:
                matched = None
                for op in operators:
                    if text.startswith(op, i):
                        matched = op
                        break
                if matched is None:
                    raise ValueError('invalid token')
                tokens.append(('OP', matched))
                i += len(matched)
        tokens.append(('EOF', 'EOF'))
        return tokens

    tokens = lex(expr)
    i = 0

    def peek():
        return tokens[i]

    def eat():
        nonlocal i
        tok = tokens[i]
        i += 1
        return tok

    def parse_primary():
        tok = peek()
        if tok[0] == 'NUMBER':
            eat()
            return tok[1]
        if tok[0] == 'IDENT':
            eat()
            return tok[1]
        if tok[0] == 'LP':
            eat()
            node = parse_expr(1)
            if peek()[0] != 'RP':
                raise ValueError('missing )')
            eat()
            return node
        raise ValueError('expected operand')

    def parse_expr(min_prec):
        left = parse_primary()
        while True:
            tok = peek()
            if tok[0] != 'OP':
                break
            op = tok[1]
            prec = precedence[op]
            if prec < min_prec:
                break
            eat()
            next_min = prec + 1 if assoc[op] == 'left' else prec
            right = parse_expr(next_min)
            left = (op, left, right)
        return left

    ast = parse_expr(1)
    if peek()[0] != 'EOF':
        raise ValueError('extra input')
    return ast

Time complexity: O(n * k) in the lexer plus O(n) in the parser, where n is expression length and k is the number of distinct operators. Space complexity: O(n).

Hints

A precedence-climbing parser is a good fit here.
For left associativity, recurse with precedence + 1; for right associativity, recurse with the same precedence.

Part 8: Choose the Best Parser Under Time and Memory Cost Models

You are given grammar statistics, a memory budget, and several program lengths to parse with the same grammar. Compute the estimated total cost of each candidate parser and choose the fastest feasible one. Use these models: LL1 -> build = n*l + n*m, parse each program = t, space = n*m; LR1 -> build = p*l + n*m, parse each program = t, space = p*m + n; Earley -> build = l, parse each program = t^3 if ambiguous else t^2, space = max_t^2; PEG -> build = l, parse each program = l*t, space = l + max_t. Total time is build plus the sum of per-program parse times. If multiple parsers tie on time, choose the one with smaller space; if still tied, choose lexicographically smaller parser name.

Constraints

Candidate names are drawn from LL1, LR1, Earley, and PEG
0 <= number of workloads <= 1000
All numeric inputs are non-negative integers

Examples

Input: {'candidates': ['LL1', 'LR1', 'Earley', 'PEG'], 'n': 3, 'p': 5, 'l': 12, 'm': 6, 'token_counts': [10, 20, 5], 'memory_budget': 100, 'ambiguous': False}

Expected Output: {'parser': 'LL1', 'time': 89, 'space': 18}

Explanation: LL1 has the lowest feasible total cost.

Input: {'candidates': ['Earley', 'PEG'], 'n': 2, 'p': 2, 'l': 3, 'm': 2, 'token_counts': [1], 'memory_budget': 10, 'ambiguous': False}

Expected Output: {'parser': 'Earley', 'time': 4, 'space': 1}

Explanation: For this tiny workload, Earley is cheaper than PEG.

Input: {'candidates': ['LL1', 'PEG'], 'n': 4, 'p': 4, 'l': 10, 'm': 5, 'token_counts': [20], 'memory_budget': 15, 'ambiguous': False}

Expected Output: {'parser': None, 'time': None, 'space': None}

Explanation: Edge case: no candidate fits within the memory budget.

Input: {'candidates': ['LL1', 'LR1'], 'n': 2, 'p': 3, 'l': 5, 'm': 4, 'token_counts': [], 'memory_budget': 20, 'ambiguous': False}

Expected Output: {'parser': 'LL1', 'time': 18, 'space': 8}

Explanation: With no programs to parse, only build cost matters.

Solution

def solution(data):
    candidates = data['candidates']
    n = data['n']
    p = data['p']
    l = data['l']
    m = data['m']
    token_counts = data['token_counts']
    memory_budget = data['memory_budget']
    ambiguous = data['ambiguous']
    max_t = max(token_counts) if token_counts else 0

    best = None
    for name in candidates:
        if name == 'LL1':
            total_time = n * l + n * m + sum(token_counts)
            space = n * m
        elif name == 'LR1':
            total_time = p * l + n * m + sum(token_counts)
            space = p * m + n
        elif name == 'Earley':
            parse_cost = 0
            for t in token_counts:
                parse_cost += t ** 3 if ambiguous else t ** 2
            total_time = l + parse_cost
            space = max_t ** 2
        elif name == 'PEG':
            total_time = l + sum(l * t for t in token_counts)
            space = l + max_t
        else:
            continue

        if space > memory_budget:
            continue

        candidate = (total_time, space, name)
        if best is None or candidate < best:
            best = candidate

    if best is None:
        return {'parser': None, 'time': None, 'space': None}
    return {'parser': best[2], 'time': best[0], 'space': best[1]}

Time complexity: O(C + W), where C is the number of candidates and W is the number of workloads. Space complexity: O(1).

Hints

Build cost is paid once, but parse cost is paid for every token count in token_counts.
For Earley and PEG, working memory depends on the largest input, not the sum.

Part 9: Stream and Parse Statements Incrementally from Input Chunks

Implement streaming parsing for the tiny assignment/expression language from Part 5. Input arrives as a list of text chunks. As soon as a complete top-level semicolon-terminated statement is available, parse it and emit its AST. A semicolon only ends a statement when it is not inside parentheses. Return all completed statement ASTs plus the unconsumed remainder.

Constraints

0 <= total input length <= 100000
Completed statements in tests are syntactically valid
The remainder may contain an incomplete trailing statement

Examples

Input: ['x = 1 +', ' 2;', 'y = x * 3;']

Expected Output: {'statements': [{'type': 'Assign', 'name': 'x', 'value': {'type': 'BinOp', 'op': '+', 'left': {'type': 'Number', 'value': 1}, 'right': {'type': 'Number', 'value': 2}}}, {'type': 'Assign', 'name': 'y', 'value': {'type': 'BinOp', 'op': '*', 'left': {'type': 'Identifier', 'name': 'x'}, 'right': {'type': 'Number', 'value': 3}}}], 'remainder': ''}

Explanation: The first statement is completed across chunk boundaries.

Input: ['(1+2', ')*3;', 'z=4']

Expected Output: {'statements': [{'type': 'ExprStmt', 'expr': {'type': 'BinOp', 'op': '*', 'left': {'type': 'BinOp', 'op': '+', 'left': {'type': 'Number', 'value': 1}, 'right': {'type': 'Number', 'value': 2}}, 'right': {'type': 'Number', 'value': 3}}}], 'remainder': 'z=4'}

Explanation: Only the first top-level semicolon-terminated statement is complete.

Input: []

Expected Output: {'statements': [], 'remainder': ''}

Explanation: Edge case: no input chunks.

Input: ['x=1;', 'y=2', '+3;']

Expected Output: {'statements': [{'type': 'Assign', 'name': 'x', 'value': {'type': 'Number', 'value': 1}}, {'type': 'Assign', 'name': 'y', 'value': {'type': 'BinOp', 'op': '+', 'left': {'type': 'Number', 'value': 2}, 'right': {'type': 'Number', 'value': 3}}}], 'remainder': ''}

Explanation: The second assignment is incomplete until the final chunk arrives.

Solution

def solution(chunks):
    def lex(text):
        tokens = []
        i = 0
        while i < len(text):
            c = text[i]
            if c.isspace():
                i += 1
            elif c.isdigit():
                j = i
                while j < len(text) and text[j].isdigit():
                    j += 1
                tokens.append(('NUMBER', int(text[i:j])))
                i = j
            elif c.isalpha() or c == '_':
                j = i
                while j < len(text) and (text[j].isalnum() or text[j] == '_'):
                    j += 1
                tokens.append(('IDENT', text[i:j]))
                i = j
            elif c in '+-*/=();':
                tokens.append((c, c))
                i += 1
            else:
                raise ValueError('invalid character')
        tokens.append(('EOF', 'EOF'))
        return tokens

    def parse_statement_text(text):
        tokens = lex(text)
        i = 0

        def peek(k=0):
            return tokens[i + k][0]

        def eat(expected=None):
            nonlocal i
            tok = tokens[i]
            if expected is not None and tok[0] != expected:
                raise ValueError('syntax error')
            i += 1
            return tok

        def parse_factor():
            if peek() == 'NUMBER':
                return {'type': 'Number', 'value': eat('NUMBER')[1]}
            if peek() == 'IDENT':
                return {'type': 'Identifier', 'name': eat('IDENT')[1]}
            if peek() == '(':
                eat('(')
                node = parse_expr()
                eat(')')
                return node
            raise ValueError('expected factor')

        def parse_term():
            node = parse_factor()
            while peek() in ('*', '/'):
                op = eat()[0]
                right = parse_factor()
                node = {'type': 'BinOp', 'op': op, 'left': node, 'right': right}
            return node

        def parse_expr():
            node = parse_term()
            while peek() in ('+', '-'):
                op = eat()[0]
                right = parse_term()
                node = {'type': 'BinOp', 'op': op, 'left': node, 'right': right}
            return node

        if peek() == 'IDENT' and peek(1) == '=':
            name = eat('IDENT')[1]
            eat('=')
            value = parse_expr()
            eat(';')
            eat('EOF')
            return {'type': 'Assign', 'name': name, 'value': value}
        expr = parse_expr()
        eat(';')
        eat('EOF')
        return {'type': 'ExprStmt', 'expr': expr}

    statements = []
    current = []
    depth = 0

    for chunk in chunks:
        for ch in chunk:
            current.append(ch)
            if ch == '(':
                depth += 1
            elif ch == ')':
                if depth > 0:
                    depth -= 1
            elif ch == ';' and depth == 0:
                text = ''.join(current).strip()
                if text:
                    statements.append(parse_statement_text(text))
                current = []

    return {'statements': statements, 'remainder': ''.join(current)}

Time complexity: O(n), where n is the total number of characters across all chunks. Space complexity: O(n).

Hints

Track parenthesis depth while reading the stream.
You only need to parse a statement once its terminating semicolon appears at depth 0.