Implement topological sort from string input
Company: Netflix
Role: Software Engineer
Category: Coding & Algorithms
Difficulty: Medium
Interview Round: Onsite
Quick Answer: This question evaluates a candidate's skills in parsing unstructured dependency strings into directed graphs, producing a valid topological order, detecting and reporting minimal cycles, and reasoning about robustness against duplicate, unknown, or malformed inputs as well as time and space complexity.
Part 1: Parse Dependency Strings into a Directed Graph
Constraints
- 0 <= len(lines) <= 10^4
- Total input length across all lines <= 2 * 10^5 characters
- Task names contain only letters, digits, and underscores
Examples
Input: ['install A after B', 'C->A', 'B before D']
Expected Output: {'A': [], 'B': ['A', 'D'], 'C': ['A'], 'D': []}
Explanation: Three different input formats all contribute edges to the same graph.
Input: [' Task1 -> Task2 ']
Expected Output: {'Task1': ['Task2'], 'Task2': []}
Explanation: Spaces around the arrow should be ignored.
Input: []
Expected Output: {}
Explanation: Edge case: no lines means no tasks and no edges.
Input: ['A before A']
Expected Output: {'A': ['A']}
Explanation: A self-dependency is still a valid directed edge at parse time.
Hints
- Normalize each line with strip() before parsing.
- The phrase 'install X after Y' reverses the edge direction: Y -> X.
Part 2: Return One Valid Topological Ordering
Constraints
- 0 <= number of tasks <= 2 * 10^5
- 0 <= number of edges <= 2 * 10^5
- Task names are strings
- Duplicate task names and duplicate edges may appear
Examples
Input: (['A', 'B', 'C', 'D'], [('B', 'A'), ('C', 'A'), ('B', 'D')])
Expected Output: ['B', 'C', 'A', 'D']
Explanation: Both B and C are initially available, so pick B first because it is lexicographically smaller.
Input: (['X'], [])
Expected Output: ['X']
Explanation: Edge case: a single isolated task is already a valid topological order.
Input: ([], [])
Expected Output: []
Explanation: Edge case: the empty graph has an empty topological ordering.
Input: (['A', 'B'], [('A', 'B'), ('B', 'A')])
Expected Output: None
Explanation: A cycle makes topological ordering impossible.
Input: (['A', 'B', 'C'], [('A', 'C'), ('A', 'C'), ('B', 'C')])
Expected Output: ['A', 'B', 'C']
Explanation: Duplicate edges should not change the result.
Hints
- Kahn's algorithm is a natural fit here.
- Use a min-heap instead of a queue if you want the lexicographically smallest valid order.
Part 3: Detect and Report a Minimal Directed Cycle
Constraints
- 1 <= number of tasks <= 200 for the intended shortest-cycle approach
- 0 <= number of edges <= 5000
- Duplicate edges may appear
Examples
Input: (['A', 'B', 'C', 'D'], [('A', 'B'), ('B', 'C'), ('C', 'A'), ('C', 'D')])
Expected Output: ['A', 'B', 'C', 'A']
Explanation: The only cycle is A -> B -> C -> A.
Input: (['A', 'B', 'C'], [('A', 'A'), ('A', 'B'), ('B', 'C'), ('C', 'A')])
Expected Output: ['A', 'A']
Explanation: A self-loop is a cycle of length 1, which is minimal.
Input: (['A', 'B', 'C'], [('A', 'B'), ('B', 'C')])
Expected Output: []
Explanation: This graph is acyclic.
Input: (['B', 'A', 'D', 'C'], [('A', 'B'), ('B', 'A'), ('C', 'D'), ('D', 'C')])
Expected Output: ['A', 'B', 'A']
Explanation: There are two shortest 2-cycles, but ['A', 'B', 'A'] is lexicographically smaller than ['C', 'D', 'C'].
Hints
- A shortest directed cycle through a start node can be found by looking for the shortest path from that start node to one of its predecessors.
- BFS is useful because all edges have equal weight.
Part 4: Robust Dependency Parsing with Duplicate, Unknown, and Malformed Input Handling
Constraints
- 0 <= len(known_tasks) <= 10^4
- 0 <= len(lines) <= 10^4
- Total input length <= 2 * 10^5 characters
Examples
Input: (['A', 'B', 'C', 'D'], ['install A after B', 'B before D', 'B before D', 'X->A', 'bad line', 'C->A'])
Expected Output: {'graph': {'A': [], 'B': ['A', 'D'], 'C': ['A'], 'D': []}, 'ignored_duplicates': 1, 'unknown_tasks': ['X'], 'malformed_lines': [4]}
Explanation: One duplicate edge is ignored, one unknown task is reported, and one malformed line is recorded.
Input: (['A', 'B'], [])
Expected Output: {'graph': {'A': [], 'B': []}, 'ignored_duplicates': 0, 'unknown_tasks': [], 'malformed_lines': []}
Explanation: Edge case: no lines means no errors and an empty graph over the known tasks.
Input: (['A', 'B'], ['A before', 'Y before Z', 'install A after B'])
Expected Output: {'graph': {'A': [], 'B': ['A']}, 'ignored_duplicates': 0, 'unknown_tasks': ['Y', 'Z'], 'malformed_lines': [0]}
Explanation: The first line is malformed, the second line references only unknown tasks, and the third line is valid.
Input: (['A'], ['', 'A->A', 'A->A'])
Expected Output: {'graph': {'A': ['A']}, 'ignored_duplicates': 1, 'unknown_tasks': [], 'malformed_lines': []}
Explanation: Blank lines are ignored and duplicate self-edges are counted once.
Hints
- Separate parsing from validation. First detect whether a line matches one of the supported formats, then check whether both tasks are known.
- A set is useful both for deduplicating edges and for collecting unique unknown task names.
Part 5: Compute Time and Space Estimates for Dependency Processing
Constraints
- 0 <= L, N, M <= 10^9
- lexicographic is a boolean
Examples
Input: (25, 4, 3, False)
Expected Output: (35, 11, 'O(L + N + M)', 'O(N + M)')
Explanation: Time = 25 parse + 3 build + 4 queue removals + 3 relaxations = 35.
Input: (25, 4, 3, True)
Expected Output: (55, 11, 'O(L + M + N log N)', 'O(N + M)')
Explanation: ceil(log2(5)) = 3, so heap work is 2 * 4 * 3 = 24.
Input: (0, 0, 0, False)
Expected Output: (0, 0, 'O(L + N + M)', 'O(N + M)')
Explanation: Edge case: an empty dataset has zero exact work and zero exact storage.
Input: (10, 1, 0, True)
Expected Output: (12, 2, 'O(L + M + N log N)', 'O(N + M)')
Explanation: ceil(log2(2)) = 1, so heap work is 2 operations beyond parsing.
Hints
- Be careful to include both graph-building work and topological-processing work.
- For heap mode, use ceil(log2(N + 1)); when N = 0, the heap cost is 0.
Part 6: Batched Topological Sort for Large Graphs
Constraints
- 0 <= number of tasks <= 2 * 10^5
- 0 <= number of edges <= 2 * 10^5
- 1 <= batch_size
- Duplicate edges may appear
Examples
Input: (['A', 'B', 'C', 'D', 'E'], [('A', 'D'), ('B', 'D'), ('C', 'E')], 2)
Expected Output: [['A', 'B'], ['C', 'D'], ['E']]
Explanation: The first batch takes the two smallest ready tasks, A and B.
Input: (['A', 'B', 'C', 'D', 'E'], [('A', 'D'), ('B', 'D'), ('C', 'E')], 1)
Expected Output: [['A'], ['B'], ['C'], ['D'], ['E']]
Explanation: With batch size 1, this becomes a deterministic one-at-a-time topological sort.
Input: ([], [], 3)
Expected Output: []
Explanation: Edge case: an empty graph produces no batches.
Input: (['A', 'B'], [('A', 'B'), ('B', 'A')], 2)
Expected Output: None
Explanation: A cycle prevents all tasks from ever becoming ready.
Hints
- This is still Kahn's algorithm; the main change is that you pop several ready nodes at once.
- Use a min-heap so each batch is lexicographically deterministic.