Text this: Direct and indirect multimodal few-shot learning of speech and images