All the articles with the tag "evals".
A practical look at the MyPaperPop eval suite, what it measures, and why the first goal was not better AI but safer product behavior.