PracHub
QuestionsPremiumCoachesLearningGuidesInterview Prep

Quick Overview

This coding question tests string parsing, tokenization logic, and pattern-matching algorithm design in a practical content moderation context. It evaluates proficiency in string manipulation, set lookups, and Trie-based multi-pattern search — progressing from single-word filtering to efficient contiguous phrase detection at scale.

  • medium
  • Whatnot
  • Coding & Algorithms
  • Software Engineer

Content Safety Filter

Company: Whatnot

Role: Software Engineer

Category: Coding & Algorithms

Difficulty: medium

Interview Round: Technical Screen

# Content Safety Filter You are building a content-safety component for a messaging platform. The system receives a stream of chat messages and must drop any message that contains banned content before it is shown to other users. Each message is a pair of `(username, text)`, where `text` is a single line of free-form chat. You are also given a list of banned terms. Your job is to return only the messages that are considered **safe**. ## Part 1 — Filter unsafe words Given a list of messages and a list of banned **words**, return the messages whose `text` contains **none** of the banned words. Tokenization and matching rules (these are the tricky part): - A message's `text` is split into tokens **on whitespace only** (one or more spaces). Splitting on whitespace is the *only* way tokens are formed. - **Punctuation is part of the token.** No punctuation stripping is performed. For example, in the text `i love Nintendo, do you`, the tokens are `i`, `love`, `Nintendo,`, `do`, `you`. The token `Nintendo,` (with the trailing comma) is **not** the same token as `Nintendo`. So if `nintendo` is a banned word, the text `i love Nintendo, do you` does **not** match it, but `i love Nintendo do you` does. - Matching is **case-insensitive**: a token matches a banned word if they are equal after lowercasing both. For example, the token `NINTENDO` matches the banned word `nintendo`. - A message is **unsafe** if **any** of its tokens equals **any** banned word (under the rules above); otherwise it is **safe**. Return the safe messages **in their original input order**. ### Input - `messages`: a list of `(username, text)` pairs, where `username` and `text` are strings. - `banned_words`: a list of strings (single words, no internal whitespace). ### Output - The sublist of `messages` (each still a `(username, text)` pair) that are safe, in the original order. ### Example ``` messages = [ ("alice", "i love Nintendo, do you"), ("bob", "i love Nintendo do you"), ("carol", "great game NINTENDO rocks"), ("dave", "nothing banned here"), ] banned_words = ["nintendo", "sega"] ``` Expected safe output: ``` [ ("alice", "i love Nintendo, do you"), # token is "Nintendo," (with comma) -> no exact match -> safe ("dave", "nothing banned here"), ] ``` `bob` is unsafe (token `Nintendo` matches `nintendo`), and `carol` is unsafe (token `NINTENDO` matches `nintendo` case-insensitively). ## Part 2 (Follow-up) — Filter unsafe phrases Now upgrade the filter so that banned content can be a **phrase**: a sequence of one or more words that must appear **consecutively** (in order, adjacent tokens) inside a message's token list to count as a match. Given the same messages and a list of banned **phrases** (each phrase is itself a sequence of words separated by single spaces), return the messages whose token sequence contains **none** of the banned phrases as a contiguous run of tokens. Rules: - Use the **same** tokenization as Part 1 (split on whitespace; punctuation stays attached to the token). - Matching of each phrase word against a message token is **case-insensitive**, token-by-token. - A phrase matches if its sequence of words equals, in order, a contiguous run of tokens in the message (e.g. the phrase `play nintendo now` matches a message whose tokens contain `... play nintendo now ...` consecutively, case-insensitively, with no gaps and no reordering). - Single-word phrases reduce to the Part 1 behavior. - A message is **unsafe** if **any** banned phrase matches; otherwise it is **safe**. Return safe messages in original order. ### Input - `messages`: a list of `(username, text)` pairs. - `banned_phrases`: a list of strings; each string is one phrase whose words are separated by single spaces. ### Output - The sublist of safe messages, in original order. ### Example ``` messages = [ ("alice", "lets play nintendo now ok"), ("bob", "lets play sega now ok"), ("carol", "play now nintendo later"), ] banned_phrases = ["play nintendo now"] ``` Expected safe output: ``` [ ("bob", "lets play sega now ok"), ("carol", "play now nintendo later"), # words present but not consecutive in order -> safe ] ``` Only `alice` is unsafe: her tokens contain `play`, `nintendo`, `now` consecutively and in order. ### Constraints - Let $M$ be the number of messages and $L$ the total number of tokens across all messages. - Let $B$ be the number of banned words/phrases and $W$ the total number of words across all banned phrases. - Design Part 2 so that the per-message scan is efficient even when there are many overlapping phrases (consider building all phrases into a single structure so one left-to-right pass over a message's tokens is enough, rather than re-checking every phrase independently at every position). - $1 \le M, L \le 10^5$; $0 \le B \le 10^4$; tokens and phrase words are non-empty ASCII strings.

Quick Answer: This coding question tests string parsing, tokenization logic, and pattern-matching algorithm design in a practical content moderation context. It evaluates proficiency in string manipulation, set lookups, and Trie-based multi-pattern search — progressing from single-word filtering to efficient contiguous phrase detection at scale.

Content Safety Filter — Part 1: Filter Unsafe Words

You are building a content-safety component for a messaging platform. Each message is a `(username, text)` pair where `text` is a single line of free-form chat. Given a list of messages and a list of banned **words**, return only the messages whose `text` contains **none** of the banned words, in original input order. **Tokenization & matching rules (the tricky part):** - A message's `text` is split into tokens **on whitespace only** (one or more spaces). Whitespace splitting is the *only* way tokens are formed. - **Punctuation is part of the token** — no stripping. In `i love Nintendo, do you` the tokens are `i`, `love`, `Nintendo,`, `do`, `you`. The token `Nintendo,` (with the comma) is **not** the same as `Nintendo`, so if `nintendo` is banned, `i love Nintendo, do you` does **not** match but `i love Nintendo do you` does. - Matching is **case-insensitive**: a token matches a banned word if they are equal after lowercasing both (`NINTENDO` matches `nintendo`). - A message is **unsafe** if **any** token equals **any** banned word; otherwise **safe**. **Input:** `messages` — a list of `(username, text)` string pairs. `banned_words` — a list of single-word strings (no internal whitespace). **Output:** the sublist of safe `(username, text)` pairs, in original order. **Example:** ``` messages = [ ("alice", "i love Nintendo, do you"), ("bob", "i love Nintendo do you"), ("carol", "great game NINTENDO rocks"), ("dave", "nothing banned here"), ] banned_words = ["nintendo", "sega"] # -> [("alice", "i love Nintendo, do you"), ("dave", "nothing banned here")] ``` bob is unsafe (`Nintendo` matches `nintendo`); carol is unsafe (`NINTENDO` matches case-insensitively); alice is safe because her token is `Nintendo,` with a trailing comma.

Constraints

  • 1 <= M (number of messages), L (total tokens across all messages) <= 10^5
  • 0 <= B (number of banned words) <= 10^4
  • Tokens and banned words are non-empty ASCII strings
  • Split on whitespace only; punctuation stays attached to the token
  • Matching is case-insensitive (compare after lowercasing)

Examples

Input: ([['alice', 'i love Nintendo, do you'], ['bob', 'i love Nintendo do you'], ['carol', 'great game NINTENDO rocks'], ['dave', 'nothing banned here']], ['nintendo', 'sega'])

Expected Output: [['alice', 'i love Nintendo, do you'], ['dave', 'nothing banned here']]

Explanation: bob's token `Nintendo` matches and carol's token `NINTENDO` matches case-insensitively, both unsafe. alice's token is `Nintendo,` (comma attached), which is a different token, so she is safe. dave has no banned tokens.

Input: ([], ['x'])

Expected Output: []

Explanation: No messages -> empty result.

Input: ([['u', 'hello world']], [])

Expected Output: [['u', 'hello world']]

Explanation: Empty banned list -> nothing is ever unsafe; every message survives.

Input: ([['u', 'SeGa rules'], ['v', 'sega']], ['SEGA'])

Expected Output: []

Explanation: Banned word `SEGA` lowercases to `sega`; tokens `SeGa` and `sega` both match case-insensitively, so both messages are unsafe.

Input: ([['u', 'word'], ['v', 'words'], ['w', 'word.']], ['word'])

Expected Output: [['v', 'words'], ['w', 'word.']]

Explanation: Only exact token equality counts: `word` matches and is removed, but `words` and `word.` are different tokens and stay safe (no substring matching).

Input: ([['u', ' spaced out here']], ['gap'])

Expected Output: [['u', ' spaced out here']]

Explanation: Leading and runs of internal whitespace collapse during tokenization to `spaced`, `out`, `here`; none match `gap`, so the (original, untouched) text is returned safe.

Hints

  1. Lowercase every banned word once and store them in a hash set for O(1) membership tests.
  2. Use plain whitespace splitting (Python's str.split() with no args, which also collapses runs of spaces). Do NOT strip punctuation — `Nintendo,` must stay distinct from `Nintendo`.
  3. A message is unsafe the moment one of its lowercased tokens is in the banned set; otherwise keep it. Preserve input order by appending as you scan.

Content Safety Filter — Part 2: Filter Unsafe Phrases

Building on Part 1, banned content can now be a **phrase**: a sequence of one or more words that must appear **consecutively** (in order, adjacent tokens, no gaps, no reordering) inside a message's token list to count as a match. Given the same messages and a list of banned **phrases** (each phrase's words separated by single spaces), return the messages whose token sequence contains **none** of the banned phrases as a contiguous run of tokens, in original order. **Rules:** - Use the **same** tokenization as Part 1 (split on whitespace; punctuation stays attached). - Each phrase word is matched against a message token **case-insensitively**, token-by-token. - A phrase matches when its words equal, in order, a contiguous run of tokens (e.g. phrase `play nintendo now` matches `... play nintendo now ...` consecutively). - Single-word phrases reduce to Part 1. - A message is **unsafe** if **any** banned phrase matches; otherwise **safe**. **Input:** `messages` — list of `(username, text)` pairs. `banned_phrases` — list of strings, each a phrase whose words are separated by single spaces. **Output:** the sublist of safe messages, in original order. **Example:** ``` messages = [ ("alice", "lets play nintendo now ok"), ("bob", "lets play sega now ok"), ("carol", "play now nintendo later"), ] banned_phrases = ["play nintendo now"] # -> [("bob", "lets play sega now ok"), ("carol", "play now nintendo later")] ``` Only alice is unsafe: her tokens contain `play`, `nintendo`, `now` consecutively in order. carol has all three words but not consecutively in order, so she is safe. **Efficiency note:** design the per-message scan to stay efficient with many overlapping phrases — group phrases (e.g. by length, or build a single Aho–Corasick / trie structure over tokens) so one left-to-right pass over a message handles all phrases, rather than re-checking every phrase independently at every position.

Constraints

  • 1 <= M (number of messages), L (total tokens across all messages) <= 10^5
  • 0 <= B (number of banned phrases) <= 10^4; W = total words across all phrases
  • Tokens and phrase words are non-empty ASCII strings
  • Same tokenization as Part 1: split on whitespace, punctuation stays attached
  • Phrase match must be contiguous, in order, and case-insensitive token-by-token; single-word phrases reduce to Part 1

Examples

Input: ([['alice', 'lets play nintendo now ok'], ['bob', 'lets play sega now ok'], ['carol', 'play now nintendo later']], ['play nintendo now'])

Expected Output: [['bob', 'lets play sega now ok'], ['carol', 'play now nintendo later']]

Explanation: alice has `play nintendo now` consecutively in order -> unsafe. bob has `play sega now` (sega breaks the phrase). carol has all three words but `play now nintendo` order/adjacency doesn't match. Both bob and carol are safe.

Input: ([['u', 'i love Nintendo, do you'], ['v', 'i love Nintendo do you']], ['nintendo'])

Expected Output: [['u', 'i love Nintendo, do you']]

Explanation: Single-word phrase reduces to Part 1: `v`'s token `Nintendo` matches `nintendo` case-insensitively (unsafe); `u`'s token is `Nintendo,` with comma, a different token, so safe.

Input: ([['u', 'PLAY Nintendo NOW everyone']], ['play nintendo now'])

Expected Output: []

Explanation: Case-insensitive token-by-token: `PLAY Nintendo NOW` lowercases to `play nintendo now`, matching the phrase consecutively -> unsafe.

Input: ([], ['x y'])

Expected Output: []

Explanation: No messages -> empty result regardless of phrases.

Input: ([['u', 'a b c'], ['v', 'a b']], [])

Expected Output: [['u', 'a b c'], ['v', 'a b']]

Explanation: Empty phrase list -> nothing can be unsafe; all messages survive.

Input: ([['u', 'go go go'], ['v', 'go go']], ['go go go'])

Expected Output: [['v', 'go go']]

Explanation: `u`'s tokens `go go go` contain the 3-word phrase consecutively -> unsafe. `v` has only two `go` tokens, too short to contain the 3-word phrase -> safe.

Input: ([['u', 'end of phrase test phrase here']], ['phrase here'])

Expected Output: []

Explanation: The phrase `phrase here` appears at the very end of the token list (`... phrase here`), matching consecutively -> unsafe.

Input: ([['u', 'red blue green'], ['v', 'blue green red']], ['blue green', 'red blue'])

Expected Output: []

Explanation: `u` contains `red blue` and also `blue green`; `v` contains `blue green`. Each message matches at least one banned phrase, so both are unsafe and the safe result is empty.

Hints

  1. Lowercase both the message tokens and each phrase's words so all comparisons are case-insensitive.
  2. Bucket phrases by their word-count k and store each as a tuple in a set per length. For a message of n tokens, slide a window of size k and check if the k-token tuple is in that length's set — this avoids re-matching every phrase from scratch at each position.
  3. For the most overlapping-phrase-efficient design, build a single trie / Aho–Corasick automaton over the phrase token sequences and run one left-to-right pass over each message's tokens, flagging unsafe on the first complete phrase hit.
  4. A message is safe only if NO banned phrase matches anywhere; short-circuit and stop scanning the moment one phrase matches.
Last updated: Jun 24, 2026

Loading coding console...

PracHub

Master your tech interviews with 8,000+ real questions from top companies.

Product

  • Questions
  • Learning Tracks
  • Interview Guides
  • Resources
  • Premium
  • For Universities
  • Student Access

Browse

  • By Company
  • By Role
  • By Category
  • Topic Hubs
  • SQL Questions
  • Compare Platforms
  • Discord Community

Support

  • support@prachub.com
  • (916) 541-4762

Legal

  • Privacy Policy
  • Terms of Service
  • About Us

© 2026 PracHub. All rights reserved.

Related Coding Questions

  • Solve Adjacent-Deletion and Sorted-Square Problems - Whatnot (easy)
  • Count User Journey Prefixes - Whatnot (easy)
  • Build a Tic-Tac-Toe App - Whatnot (medium)
  • Build User Journey Path Summary - Whatnot (hard)