In the context of Transformer-style models, analyze the computational complexity of self-attention.
Assume a sequence length of (n) and hidden dimension (d).
-
Derive the time and space complexity of standard scaled dot-product self-attention.
-
Explain why this becomes a bottleneck for long sequences.
-
Describe at least three classes of methods that reduce the complexity (e.g., sparse attention, low-rank or kernel-based approximations, chunking/segmenting), including their high-level ideas and trade-offs.