Text this: Double-Attention Transformer for Cross-Modal Image Captioning: Enhancing Visual–Linguistic Alignment on Low-Resource Datasets