Detecting Malicious Duplicated Text (DOT)
Scenario
You are selecting technical approaches for DOT, a bot‑detection tool aimed at finding malicious duplicated content across posts/comments at large scale and in near real time.
Assume the system must:
-
Detect exact and near-duplicate text (minor edits, punctuation, spacing, emojis, casing).
-
Scale to billions of items with low latency.
-
Handle multilingual content and adversarial obfuscations.
-
Distinguish benign mass-copying (e.g., news headlines) from coordinated malicious campaigns.
Question
What models or algorithms could help identify malicious duplicated content, and why are they suitable?
Hints
-
Text hashing (exact duplicates), fuzzy hashing
-
TF‑IDF with cosine similarity and inverted indexes
-
Embeddings with Siamese/bi‑encoder networks; transformer encoders (multilingual)
-
MinHash/SimHash with LSH for near‑duplicates
-
Clustering and similarity thresholds (e.g., DBSCAN/connected components)
-
Supervised vs. unsupervised approaches; candidate generation vs. scoring