Stream Processor: Query Registration and Log Tagging
Context
You are designing a streaming component that ingests a single mixed stream of messages. Each message is either:
-
A query registration (prefix "Q:"), which defines a string pattern to search for in future log lines.
-
A log line (prefix "L:"), which must be tagged with the IDs of all queries whose pattern appears in the log text.
Assume a query is a case-sensitive substring pattern (extendable to regex later). Query IDs are assigned incrementally starting at 1, in the order queries arrive. The system outputs an acknowledgment when a query is registered, and for each log it emits the list of matching query IDs (sorted ascending).
Example
Input stream:
-
Q: error
-
Q: timeout
-
L: database timeout after 5s
-
L: ERROR: connection reset
-
Q: reset
-
L: timeout reset error
Expected outputs:
-
For 1) → ACK 1
-
For 2) → ACK 2
-
For 3) → L: database timeout after 5s | matches: [2]
-
For 4) → L: ERROR: connection reset | matches: [3] (note: case-sensitive, so "ERROR" doesn't match "error")
-
For 5) → ACK 3
-
For 6) → L: timeout reset error | matches: [1, 2, 3]
Tasks
-
Design and implement a function/process that:
-
Assigns incremental IDs to queries as they arrive and outputs an acknowledgment (e.g., "ACK
<id>
").
-
For each log line, emits the log plus the list of matching query IDs, using the matching semantics defined above.
-
How would you modify your design to efficiently handle a very large volume of data (both queries and logs)?
-
How would you support deletion of queries in your current implementation, and what inefficiencies need to be addressed?
Assumptions
-
Matching is case-sensitive substring search; you may note how to extend to case-insensitive or regex.
-
Queries apply to logs arriving after their registration (no retroactive tagging).
-
In-order processing of the single input stream is sufficient for correctness (you may discuss scaling beyond one process).