From: The Scaling Law of Evaluation Failure (arXiv:2605.11205)

Evaluation Failure Calculator

Compare simple-average vs IRT rankings on your benchmark data

Input Matrix

Paste a matrix: first column = model names, first row = item/benchmark names. Empty cells = missing data. Values should be scores between 0 and 1 (or 0-100, auto-detected).

Your data never leaves your browser. All computation runs locally. No data is sent to any server.