Text this: Improving Reproducibility in Evaluation through Multi-Level Annotator Modeling