Similar Items: Improving Reproducibility in Evaluation through Multi-Level Annotator Modeling