Text this: Causal Inference and Text-Assisted Localization for Medical Video Localization