You are given DNA sequences over the alphabet {A, C, G, T}. A small labeled dataset contains both real and fake DNA sequences. In addition, you have a much larger dataset containing only real DNA sequences.
Design a machine learning approach to classify whether a DNA sequence is real or fake.
Address the following:
-
How would you represent DNA sequences for modeling (for example, k-mer features, learned embeddings, CNN/RNN/Transformer-based sequence models, or biologically motivated features)?
-
How would you leverage the large real-only dataset? Would you frame this as standard supervised learning, positive-unlabeled learning, anomaly detection / one-class classification, self-supervised pretraining, or a hybrid approach?
-
Would you generate additional fake sequences to augment training? If yes, how would you create synthetic negatives that are realistic and not trivially easy for the model to distinguish?
-
How would you train and evaluate the model given limited labeled data, likely class imbalance, and the risk that synthetic fake sequences may not match real-world fake sequences?
-
What important failure modes or domain-specific issues would you check for, such as reverse-complement symmetry, variable sequence length, GC-content bias, duplicated or near-duplicated sequences, and distribution shift between train and test data?
Your answer should propose a practical modeling strategy, explain trade-offs, and describe how you would validate that the classifier is learning biological structure rather than artifacts.