Text this: Orchestrating Generative Models for Synthesizing Domain-Specific Vision–Language Instructions