You are building a model to predict whether a user will successfully file taxes (binary label success) for a TurboTax-like product.
One of the most predictive features is:
session_count
=
cumulative number of sessions
a user has had in the product.
However:
session_count
has many values that are
0
and many that are
missing
.
session_count
is
not available at scoring time
(i.e., when you need to make the prediction), even though it appears in the schema.
session_count
is
negatively correlated
with
success
.
session_count
is often
0
or missing, and how would you treat these cases during modeling?
session_count
is not available at inference time, what are your options? How do you decide whether to (a) drop it, (b) engineer a proxy, or (c) change the prediction timing / problem definition?
session_count
and
success
(including an “opposite viewpoint”), and describe what additional data or analyses you would use to validate/refute each explanation.
Login required