Suppose you are given DNA sequences represented as strings over {A, C, G, T}, and sequence lengths may vary. You have:
-
a small labeled dataset containing both real DNA sequences and fake DNA sequences,
-
a much larger dataset containing only confirmed real DNA sequences.
Design a machine learning system that predicts whether a new DNA sequence is real or fake.
Discuss:
-
how you would represent the sequences (for example, k-mer features, handcrafted biological features, embeddings, or sequence models),
-
whether you would frame this as standard supervised classification, anomaly detection / one-class learning, or a hybrid approach,
-
how you would create or augment fake examples without making them unrealistically easy to distinguish,
-
what model(s) you would try first and why,
-
how you would evaluate performance under class imbalance and possible distribution shift,
-
how you would prevent train/test leakage from duplicated or highly similar sequences,
-
how you would calibrate the final decision threshold if falsely classifying a real sequence as fake is costly.