Design error detection and propagation algorithms
Company: Snowflake
Role: Software Engineer
Category: Coding & Algorithms
Difficulty: Medium
Interview Round: Technical Screen
Quick Answer: This question evaluates algorithm design and complexity-analysis skills across search and graph problems, specifically log search for first-error detection, reachability and failure propagation in directed service dependency graphs, and longest-path reasoning under cycle or DAG assumptions.
Part 1: First Error in Monotonic Logs
Constraints
- 0 <= len(logs) <= 10^6
- Each log line starts with exactly one of: '[Info]', '[Warn]', '[Error]'
- The array is in chronological order
- If logs[i] starts with '[Error]', then all logs[j] for j > i also start with '[Error]'
- If any error exists, its index is at least 1 and logs[first_error_index - 1] starts with '[Warn]'
Examples
Input: ["[Info] boot", "[Warn] retrying", "[Error] disk failed", "[Error] shutting down"]
Expected Output: 2
Explanation: The first error appears at index 2.
Input: ["[Info] start", "[Info] running", "[Warn] slow response"]
Expected Output: -1
Explanation: There is no '[Error]' line, so return -1.
Input: []
Expected Output: -1
Explanation: Empty input contains no errors.
Input: ["[Warn] almost full", "[Error] out of memory"]
Expected Output: 1
Explanation: The smallest valid case with an error has the first error at index 1.
Solution
def solution(logs):
left = 0
right = len(logs) - 1
answer = -1
while left <= right:
mid = (left + right) // 2
if logs[mid].startswith('[Error]'):
answer = mid
right = mid - 1
else:
left = mid + 1
return answerTime complexity: O(log n). Space complexity: O(1).
Hints
- Turn each line into a boolean question: 'Is this line an error?' What special pattern does that boolean array have?
- You are looking for the first position where a monotonic condition becomes true.
Part 2: Failure Fan-Out in a Service Graph
Constraints
- 0 <= number of services <= 2 * 10^5
- 0 <= number of edges <= 4 * 10^5
- Service IDs are strings
- The graph may be disconnected
- The graph may contain cycles
Examples
Input: ({'A': ['B', 'C'], 'B': ['D'], 'E': ['B']}, 'D')
Expected Output: {'A', 'B', 'D', 'E'}
Explanation: D fails first, so B fails. Then both A and E, which depend on B, also fail.
Input: ({'Auth': ['DB'], 'API': ['Auth'], 'Worker': ['Queue']}, 'DB')
Expected Output: {'API', 'Auth', 'DB'}
Explanation: DB failure propagates to Auth and then to API. Worker is unaffected.
Input: ({'A': ['B'], 'B': ['C'], 'C': ['A']}, 'C')
Expected Output: {'A', 'B', 'C'}
Explanation: The graph contains a cycle, but visited tracking prevents infinite looping.
Input: ({}, 'X')
Expected Output: {'X'}
Explanation: If the graph is empty, only the initially failed service is known to fail.
Solution
def solution(graph, failed_service):
from collections import defaultdict, deque
reverse = defaultdict(list)
for caller, callees in graph.items():
for callee in callees:
reverse[callee].append(caller)
failed = {failed_service}
queue = deque([failed_service])
while queue:
service = queue.popleft()
for caller in reverse.get(service, []):
if caller not in failed:
failed.add(caller)
queue.append(caller)
return failedTime complexity: O(V + E). Space complexity: O(V + E).
Hints
- Failure travels in the opposite direction of the given edges.
- If the input gives caller -> callee, what graph do you need to traverse to move from a failed service to all affected callers?
Part 3: Longest Failure Propagation Chain in a DAG
Constraints
- 0 <= number of services <= 2 * 10^5
- 0 <= number of edges <= 4 * 10^5
- Service IDs are strings
- The dependency graph is a DAG
- If failed_service has no upstream callers, the answer is [failed_service]
Examples
Input: ({'A': ['B'], 'B': ['C'], 'D': ['C'], 'E': ['D'], 'F': ['E']}, 'C')
Expected Output: ['C', 'D', 'E', 'F']
Explanation: From C, one chain is C -> B -> A, but the longer one is C -> D -> E -> F.
Input: ({'API': ['Auth', 'Cache'], 'Web': ['API'], 'Mobile': ['API'], 'Frontend': ['Web'], 'Auth': ['DB']}, 'DB')
Expected Output: ['DB', 'Auth', 'API', 'Web', 'Frontend']
Explanation: The longest upstream failure chain goes from DB to Auth to API to Web to Frontend.
Input: ({'A': ['B']}, 'A')
Expected Output: ['A']
Explanation: A has no upstream callers in the propagation direction, so the chain contains only A.
Input: ({}, 'X')
Expected Output: ['X']
Explanation: With no graph edges, the longest chain is just the starting service.
Solution
def solution(graph, failed_service):
from collections import defaultdict, deque
reverse = defaultdict(list)
for caller, callees in graph.items():
for callee in callees:
reverse[callee].append(caller)
reachable = {failed_service}
stack = [failed_service]
while stack:
service = stack.pop()
for caller in reverse.get(service, []):
if caller not in reachable:
reachable.add(caller)
stack.append(caller)
indegree = {service: 0 for service in reachable}
for service in reachable:
for caller in reverse.get(service, []):
if caller in reachable:
indegree[caller] += 1
dist = {service: float('-inf') for service in reachable}
parent = {failed_service: None}
dist[failed_service] = 0
end = failed_service
queue = deque([failed_service])
while queue:
service = queue.popleft()
for caller in reverse.get(service, []):
if caller in reachable and dist[service] + 1 > dist[caller]:
dist[caller] = dist[service] + 1
parent[caller] = service
if dist[caller] > dist[end]:
end = caller
if caller in reachable:
indegree[caller] -= 1
if indegree[caller] == 0:
queue.append(caller)
path = []
current = end
while current is not None:
path.append(current)
current = parent.get(current)
return path[::-1]Time complexity: O(V + E). Space complexity: O(V + E).
Hints
- Reverse the edges so they point in the same direction as failure propagation.
- In a DAG, longest paths can be computed with dynamic programming over a topological order.