Text this: Transformer-based fusion of acoustic and textual cues with proportional augmentation for emotion recognition