Median vs. Mean Under L1 and L2 Loss, and the 2D Extension
Explain, with intuition and a brief derivation, the relationship between the choice of loss function and the optimal point estimate.
Specifically, address all of the following:
-
1D.
Why does the
median
minimize the sum of absolute deviations (
L1
), while the
mean
minimizes the sum of squared deviations (
L2
)?
-
2D.
When you minimize total
Manhattan
distance to a set of 2D points versus total
Euclidean
distance, which estimator do you get in each case? (Be explicit about whether "Euclidean" means
squared
or
unsquared
distance — they differ.)
-
Robustness.
When is the average (mean) a poor choice, and what makes the median robust where the mean is not?
You should give the key optimality conditions / derivations (not just the answers), the geometric intuition for why each estimator falls out, and a short note on how you'd actually compute the median — including the optimal data structure and time/space complexity for the static and streaming cases.
Constraints & Assumptions
-
1D data:
x1,x2,…,xn∈R
.
-
2D points:
pi=(xi,yi)∈R2
.
-
"
L1
" = minimize the sum of absolute deviations; "
L2
" = minimize the sum of squared deviations.
-
For 2D Euclidean distance, treat both the
squared
and
unsquared
forms — they yield different optima.
-
Assume all points are weighted equally unless you choose to discuss the weighted generalization.
Clarifying Questions to Ask
-
Is "Euclidean L2" the
squared
distance
∥pi−c∥22
or the
unsquared
distance
∥pi−c∥2
? The answer changes the 2D optimum.
-
Is a
closed-form / exact
answer required, or is an iterative algorithm acceptable for the case that has no closed form?
-
For the computational part, is the data
static
(one-shot computation) or arriving as a
stream
(running median)? Bounded-memory / approximate answers OK?
-
Should the estimate be a single point, or is the full
set
of minimizers expected when it is not unique (e.g. even
n
)?
What a Strong Answer Covers
-
Correct optimality conditions, derived.
J′(μ)=0⇒∑i(xi−μ)=0
for the mean; the subgradient/sign-count condition for the median; states both objectives are convex so the stationary point is global.
-
Intuition for both estimators.
Mean = balance of
linear restoring forces
(center of mass); median = balance of
counts
on each side, independent of magnitudes.
-
The 2D separability argument.
Recognizes that Manhattan and squared-Euclidean separate by coordinate (→ coordinate-wise median and centroid respectively), while unsquared Euclidean does
not
separate (→ geometric median, no closed form, solved by Weiszfeld's algorithm).
-
Robustness framed quantitatively.
Mean breakdown point
0%
vs. median
50%
; ties this to outliers, skew, and (optionally) multimodality; names robust alternatives.
-
Non-uniqueness handled.
Even-
n
1D median is an interval; the 2D Manhattan optimum is the product of the two coordinate intervals.
-
Computation.
Mean in
O(n)
time /
O(1)
space; median via Quickselect (
O(n)
average) or sort (
O(nlogn)
); streaming median via two heaps (
O(logn)
insert,
O(1)
query); correct complexity claims.
Follow-up Questions
-
Maintain a
sliding-window
median of the last
k
elements of a stream. What data structure gives
O(logk)
per slide, and what's the trick for deletion from a heap?
-
The mean is the maximum-likelihood estimate under Gaussian noise. Under what noise model is the
median
the MLE, and how does that connect to
L1
loss?
-
Generalize to
weighted
data: what do the 1D minimizers become, and does the 2D Manhattan answer stay coordinate-wise separable?
-
For the geometric median, Weiszfeld's iteration divides by
∥pi−c(t)∥
. What goes wrong when an iterate lands exactly on a data point, and how is it patched?