blue-green banner

https://mlbenchmarks.org/11-evaluating-language-models.html
Evaluating language models
Moritz Hardt
"Applying tune-before-test, rankings enjoy greater agreement across different benchmarks.... tune-before-test also aligns perplexity rankings with downstream task benchmarks."