What I Read: Attention sinks

Posted on 2025-12-18 :: Tags: transformer, attention, graph, message passing, token, embedding, directed acyclic graph, linear algebra, polynomial

https://publish.obsidian.md/the-tensor-throne/Transformers+as+GNNs/Attention+sinks+from+the+graph+perspective
Attention sinks from the graph perspective
Francesco Pappone
August 24, 2025
"...attention sinks are easy to describe: when trained, decoder-only transformer models tend to allocate a disproportionate amount of attention to the first few tokens..."