What I Read: Attention sinks.
https://publish.obsidian.md/the-tensor-throne/Transformers+as+GNNs/Attention+sinks+from+the+graph+perspective
Attention sinks from the graph perspective
Francesco Pappone
August 24, 2025
"...attention sinks are easy to describe: when trained, decoder-only transformer models tend to allocate a disproportionate amount of attention to the first few tokens..."