Similar Items: Audio-Visual Intelligence in Large Foundation Models