Evaluation Failure Calculator

Input Matrix

Paste a matrix: first column = model names, first row = item/benchmark names. Empty cells = missing data. Values should be scores between 0 and 1 (or 0-100, auto-detected).

Your data never leaves your browser. All computation runs locally. No data is sent to any server.

Ranking Comparison

Default sort is by IRT Rank ascending. Rows with |Δ Rank| > 2 are highlighted.

Rank	Model	Avg Score	Avg Rank	IRT θ	IRT Rank	Δ Rank

Rank Scatter

Average Rank vs IRT Rank. The diagonal shows agreement.

Item Analysis

Sorted by discrimination, highest first.

Item	Difficulty (b)	Discrimination (a)	Coverage	Mean Score