Text this: Seeing Realism from Simulation: Efficient Video Transfer for Vision-Language-Action Data Augmentation