How do I approach Machine Learning interview questions?

Machine Learning questions require understanding of core concepts and practice. PracHub provides solutions with explanations to help you master machine learning interviews.

What difficulty level is this interview question?

This is a medium difficulty Machine Learning question, commonly asked during Technical Screen rounds at Jane Street.

What role is this question designed for?

This question is commonly asked for Data Scientist candidates at Jane Street during technical interviews.

Build real-vs-fake DNA classifier | Jane Street Interview Question

Quick Overview

This question evaluates competency in machine learning for biological sequence data, covering sequence representation, model framing (supervised vs anomaly detection), data augmentation, model choice, evaluation under class imbalance and distribution shift, and mechanisms to prevent train/test leakage.

Suppose you are given DNA sequences represented as strings over {A, C, G, T}, and sequence lengths may vary. You have:

a small labeled dataset containing both real DNA sequences and fake DNA sequences,
a much larger dataset containing only confirmed real DNA sequences.

Design a machine learning system that predicts whether a new DNA sequence is real or fake.

Discuss:

how you would represent the sequences (for example, k-mer features, handcrafted biological features, embeddings, or sequence models),
whether you would frame this as standard supervised classification, anomaly detection / one-class learning, or a hybrid approach,
how you would create or augment fake examples without making them unrealistically easy to distinguish,
what model(s) you would try first and why,
how you would evaluate performance under class imbalance and possible distribution shift,
how you would prevent train/test leakage from duplicated or highly similar sequences,
how you would calibrate the final decision threshold if falsely classifying a real sequence as fake is costly.

Quick Overview

Suppose you are given DNA sequences represented as strings over {A, C, G, T}, and sequence lengths may vary. You have:

a small labeled dataset containing both real DNA sequences and fake DNA sequences,
a much larger dataset containing only confirmed real DNA sequences.

Design a machine learning system that predicts whether a new DNA sequence is real or fake.

Discuss:

how you would represent the sequences (for example, k-mer features, handcrafted biological features, embeddings, or sequence models),
whether you would frame this as standard supervised classification, anomaly detection / one-class learning, or a hybrid approach,
how you would create or augment fake examples without making them unrealistically easy to distinguish,
what model(s) you would try first and why,
how you would evaluate performance under class imbalance and possible distribution shift,
how you would prevent train/test leakage from duplicated or highly similar sequences,
how you would calibrate the final decision threshold if falsely classifying a real sequence as fake is costly.

Build real-vs-fake DNA classifier

Quick Overview

Solution

Comments (0)

Build real-vs-fake DNA classifier

Quick Overview

Solution

Comments (0)