You are given a text classification dataset for spam detection (binary labels: spam vs not_spam) in a Jupyter notebook environment.
Task
-
Preprocess the text (basic cleaning/tokenization is sufficient).
-
Convert text to features suitable for Naive Bayes (e.g., bag-of-words or TF-IDF).
-
Train a
Naive Bayes
classifier.
-
Evaluate the model using
F1 score
(clearly state whether it is the F1 for the positive class or a specific averaging scheme).
-
Run the trained model on a few test examples and show predicted labels (and optionally probabilities).
Constraints / Notes
-
The dataset may be class-imbalanced.
-
You should avoid data leakage (fit text vectorizer only on training data).
-
You may choose reasonable train/validation splitting if only one labeled set is provided.