Text this: Harnessing multimodal large language models to interpret ecological momentary assessment-generated caregiving photographs