Similar Items: Modeling Language and Vision at Human Scales