Text this: Audio-Visual Intelligence in Large Foundation Models